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Abstract — This paper presents a systematic metliodology based 
on the algebraic theory of signal processing to classify and derive 
fast algorithms for linear transforms. Instead of manipulating 
the entries of transform matrices, our approach derives the algo- 
rithms by stepwise decomposition of the associated signal models, 
or polynomial algebras. This decomposition is based on two 
generic methods or algebraic principles that generalize the well- 
known Cooley-lXikey FFT and make the algorithms' derivations 
concise and transparent. Application to the 16 discrete cosine 
and sine transforms yields a large class of fast algorithms, many 
of which have not been found before. 

Index Terms — Fast Fourier transform, discrete Fourier trans- 
. form, discrete cosine transform, DFT, DCT, DST, polynomial 
algebra, representation theory 
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I. Introduction 

In [1], [2], [3], we have proposed a new approach to linear 
signal processing (henceforth just referred to as signal pro- 
cessing or SP), called algebaric signal processing theory. The 
approach argues that the assumptions underlying SP provide 
structure that includes but goes beyond vector spaces and 
linear algebra and places SP more naturally into the context 
of the theory of algebras and modules, or the representation 
theory of algebras. 

In recognizing this structure, we have introduced a general, 
axiomatic approach to SP that starts from the concept of a 
signal model. Given a signal model, all major SP ingredients 
can be derived from it, including signals, filters, convolu- 
tion, associated "z-transform," spectrum, Fourier transform, 
and frequency response, among others. These concepts take 
different forms for different models, as shown in Table I, 
which is explained in detail later in Section II. For example. 
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discrete infinite and finite (finite number of samples) 1-D time 
are signal models with associated 2-transform and finite z- 
transform (defined in [1]) and the DTFT and DFT as associated 
Fourier transforms, respectively. Further, we developed signal 
models for infinite and finite 1-D space (where space is in the 
sense of an undirected graph versus directed graph for time, 
and not as 2-D versus 1-D) and showed that for the latter 
there are 16 reasonable alternatives corresponding to 16 finite 
C-transforms (defined in [2]) and showed that the 16 discrete 
cosine and sine transform (DCTs and DSTs) are the associated 
Fourier transforms. First results on higher-dimensional SP are 
also akeady available [4], [5], [6], [7], [8]. 

The algebraic theory provides a methodology for the con- 
struction of finite signal models and clarifies the role played 
by boundary conditions and their relation to signal extensions. 
In particular, we showed that any finite shift-invariant signal 
model is described by a polynomial algebra, which captures 
all the necessary information about the model. We derived 
and discussed in detail the polynomial algebras for the DFT, 
DCTs, DSTs, and most other known, as well as some new 
trigonometric transforms [2]. 

Algebraic theory of transform algorithms. In this paper, 
we apply the algebraic approach to the derivation and discov- 
ery of Fourier transform algorithms. Here, the term Fourier 
transform is meant in the general sense of the algebraic theory, 
i.e., including DFT, DFTs, DSTs, and other trigonometric hke 
transforms. In other words, we apply the algebraic SP theory 
to derive fast transform algorithms. The paper extends the 
preliminary results shown in [9], [10]. 

There is a large body of literature on fast transform al- 
gorithms. With very few exceptions (for example DFT algo- 
rithms, discussed below) these algorithms are derived by clever 
and often lengthy manipulation of the transform coefficients. 
This is hard to grasp, and provides no insight into the structure 
or the derivation of the algorithm. Further, without an appro- 
priate theory, it is hard to determine if all relevant classes 
of algorithms have been found. This is not just an academic 
problem as the variety of different implementation platforms 
and appUcation requirements makes a thorough knowledge of 
the algorithm space crucial. 

Our derivation of the fast algorithms is algebraic: we ma- 
nipulate the signal model (or polynomial algebra) underlying 
a transform rather than the transform itself. We present two 
generic theorems for polynomial algebras that generahze the 
Cooley-Tukey FFT [11]. Application to the 16 DCTs and 
DSTs yields a large set of Cooley-Tukey type algorithms, most 
of which have not been found with previous methods. The al- 
gorithm derivation is concise (no tedious index manipulations) 
and greatly simplified as there is a clear methodology. We 
draw attention to the large number of algorithms in this paper. 
However, we do not consider all existing classes of algorithms. 
In particular, all our algorithms are non-orthogonal, i.e., they 
are not built from rotations. More precisely, in this paper, we 
will not consider orthogonal algorithms (e.g., [12]), algorithms 
that compute DFTs via the DCTs/DSTs [13], [14], prime- 
factor type algorithms [15], Rader-type algorithms [16], [17], 
[18], or algorithms that do not reduced the operations count 
[19]. The algebraic principles behind some of these algorithms 



will be the subject of a future paper. 

Goal of this paper. This paper has two main goals: First, 
to explain how and why algorithms arise and how they can 
be derived in a reproducible way. Second, this paper can 
serve as a reference for readers whose interest is solely in 
the algorithms, for example, for their implementation. For this 
reason, all algorithms are presented in tables and in a form 
from which they can be easily retrieved. 

Previous work. The approach taken in this paper to derive 
algorithms using polynomial algebras builds on and extends 
early work on DFT algorithms. The known interpretation of 
the DFT„ in terms of the polynomial algebra C[2:]/ (x" — 1) 
was used to derive and explain the (radix-2) Cooley-Tukey 
FFT by Auslander, Feig, and Winograd [20] using the Chi- 
nese remainder theorem (CRT). Equivalently, Nicholson [21] 
explains DFT and FFT using group theory; so does Beth [22], 
which generahzed the approach to more general groups. Wino- 
grad's DFT algorithms [23], [24], [25], [26] and his results 
in complexity theory make heavy use of polynomial algebras 
and the CRT. So do extensions of the above work by Burrus 
et al. [27], [28]. Nussbaumer [29], [30], [31] uses polynomial 
algebras and the CRT to derive efficient 2-D FFTs that save 
multiphcations compared to the row-column method. 

For the DFT it turns out that to derive the most important 
FFTs, it is not necessary to work with polynomial algebras, 
but sufficient to work with index arithmetic modulo n. This 
approach is used in [31], [32] to provide a consistent approach 
to FFTs. However, this approach provides no insight into how 
to approach other transforms, whereas the polynomial algebra 
approach does, as we show in this paper. Further, this approach 
fits naturally with the algebraic SP theory, since polynomial 
algebras are a natural structure from an SP point of view as 
explained in [1]. 

The only (implicit) use of polynomial algebras for the DCTs 
or DSTs we found in the literature is the derivation of a DCT, 
type 3, algorithm by Steidl [33], [34]. These papers provided 
important hints for developing the work in this paper. 

Organization of the paper. Section II provides a brief 
introduction to the algebraic signal processing theory. Most 
relevant are the signal models, or polynomial algebras, asso- 
ciated with the DFT and DTTs. Section III introduces notation 
to represent algorithms as products of structured matrices. Two 
algebraic methods to derive algorithms from a polynomial 
algebra are explained in Section IV using the DFT as an 
example. Then we apply these methods to derive numerous 
Cooley-Tukey type algorithms for the DTTs in Sections V- 
X. A visual organization of the most important ones can be 
found in Figure 2 in Section VI. Finally, we offer conclusions 
in Section XI. 

II. Background: Algebraic Signal Processing 

Theory 

The algebraic signal processing theory recognizes that the 
structure available in linear signal processing (heretofore, 
simply signal processing or SP) goes beyond vector spaces 
(or hnear spaces) and is actually described by algebras and 
associated modules, which places SP in the context of (ab- 
stract) algebra. The algebraic theory provides a consistent and 
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TABLE I 

1 -D DISCRETE INFINITE AND FINITE TIME AND SPACE SIGNAL PROCESSING AS FOUR INSTANTIATIONS OF THE GENERAL ALGEBRAIC SIGNAL 

PROCESSING THEORY. 



iniiiiite time finite time infinite space finite space generic tlieory 



series in z~" 


polynomials in z~" 


series in T„ 


polynomials in T„ 


A (algebra of filters) 


series in z~" 


polynomials in z~" 


series in C„ 


polynomials in Cn 


M (^-module of signals) 


z-transform 


finite z-transform(s) 


C-transform(s) 


finite C-transform(s) 


<I> ("z-transform") 


DTFT 


DFTs 


DSFTs 


DCTs/DSTs 


(Fourier transform) 



generic framework for SP whose instantiations lead to many 
known, as well as new, ways of doing SP. 

The key concept in the algebraic theory is the signal model 
{A,A4,^). Before we define it, we introduce two algebraic 
terms: algebra and associated module, which model the spaces 
of filters and signals, respectively. 

Algebra = Space of filters. An algebra is a vector space that 
is also a ring, i.e., it has a defined multiplication of elements, 
such that the distributivity law holds. Examples of algebras 
include C (complex numbers) and C[x] (set of polynomials 
with complex coefficients). 

The crucial observation is that the set of filters in a given 
signal processing scenario (e.g., infinite discrete time) is 
usually assumed to be an algebra. Namely, the multiplication 
is the concatenation of filters. For example, in infinite discrete 
time, the set of filters (in the z-domain) is the algebra 

^ = {/i(^-i) = ^/i„^-"}, (1) 

where, for example, the coefficient sequences 
(. . . , h-i,ho, hi, . . .) e i.e., are absolutely summable. 

Module = Space of signals. Assume an algebra A is 
chosen. Then an ^-module is a vector space that permits an 
operation of elements of ^ on through linear mappings. 
This operation is the algebraic analogue of filtering in SP. 
Formally, if h G A, then there is an operation (written as 
multiplication) 

h : A4 ^ A4, m h ■ m, 

which is linear, i.e., h{s + s') = hs + hs', and h{as) = a{hs) 
for s,s' G M and a e C. 

An example of an ^-module is M = A itself with the 
operation being the multiplication in A. When M. = A, M. 
is called a regular module. 

The above properties capture exactly the structure of the 
signal space: every filter is a linear mapping on the signal 
space. The ^-module usually chosen along with A in (1) in 
infinite discrete-time signal processing is 

7W = {s = s(z-i)=^s„z-"}, (2) 

where the coefficient sequences are in ^^(Z), i.e., of finite 
energy. 

Signal model. We start with the formal definition consid- 
ering infinite and finite discrete complex signals s e over 
some index domain I. Examples include s e or s e C". 



Definition 1 (Signal model) Let < be a discrete vector 
space. A signal model for F is a triple {A,M,<^), where A is 
an algebra, is an associated ^-module, and $ is a bijective 
linear mapping 

^ : V ^ M, s s e M. 
We call a signal model regular if M. = A. 

An example is the signal model commonly adopted for 
infinite discrete-time signal processing. Namely, A is defined 
as in (1), M. as in (2), and is the 2;-transform 

$ : s s = ^ Sn-z~" e M. 

{A,M,^) is a signal model for V = £^{Z). 

The purpose of the signal model is to assign a proper notion 
of filtering to a discrete sequence s, which, taken by itself, does 
not specify how this should be done. Once a signal model is 
selected, all main concepts for SP can be derived: filtering or 
convolution (operation of A on M), associated "z-transform" 
($), spectrum, frequency response, Fourier transform, and 
others. 

The question now is, which signal models are used or make 
sense in SP. A partial answer was provided in [1], [2]: if 
shift-invariance is required, A has to be commutative. Further, 
we have shown how to derive models from basic principles, 
through a suitable definition of the shift operator. Using this 
method, we presented in [2] a theory of 1-D space SP. 

The algebraic theory provides a comprehensive theory for 
finite signal models, i.e., models for finite sequences s G C". 
In particular, it identifies for all trigonometric transforms T 
the associated signal models, i.e., those that have T as Fourier 
transform, and explain how they are obtained. 

In particular, all signal models associated to the trigonomet- 
ric transforms are finite, shift-invariant, and most of them are 
regular, i.e., A = M. The only way to obtain such models is 
through polynomial algebras as we explain next. 

A. Finite Shift-Invariant Regular 1-D Signal Models 

If {A,A4,^) is a shift-invariant signal model for finite 
1-D sequences s = {sq, ■ ■ ■ , Sn-i), then, necessarily, A = 
C[x]/p{x) is a polynomial algebra with a suitable polynomial 
p{x). It is defined as 

C[x]/p{x) = {q{x) I deg(g') < deg(p)}. 
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In words, given p{x), C[x]/p{x) is the set of all polynomials of 
degree smaller than p with addition and multipUcation modulo 
p. If deg{p) = n, then dim(C[x]/p(x)) = n. 

In this paper, we restrict ourselves to regular models, i.e., 
models with M. = A = €,[x\/p{x). With this restriction, a 
signal model for F = C" is uniquely characterized by p{x) 
and by a chosen basis h = (po, • • • ,Pn-i) of -M.. Namely, $ 
is given by 

V^£[x\/p{x), SH^s= ^ SiPi. (3) 

0<t<n 

By construction, $ is bijective. Conversely, every finite shift- 
invariant regular 1-D signal model can be expressed this way. 
Filtering in these models is equivalent to multiplying two 
polynomials (signal and filter) modulo the fixed polynomial 
p{x). Note that (3) clarifies the role of the z-transform in SP: 
the equation shows that the bijective map which generalizes 
the ^-transform, is simply an artifact to fix the basis of the 
signal module M. 

Example: finite z-transform. As an example consider 
the model A = M = C[x]/{x^ - 1) with basis b = 
(a;°, . . . , a;"-i) in M and thus, for s = (sq, • • • , s^-if € C", 

$ : s^s = s{x) = Skx'' e C[a;]/(a;" - 1) (4) 

0<fe<n 

is the finite z-transform. After applying the model, filtering is 
defined, for h = h{x) e A and s = s{x) G M as 

h{x)s{x) mod (a;" - 1), 

which is equivalent to computing the circular convolution of 
the coefficient sequences h and s (e.g., [31]). 

Fourier transform. The Fourier transform for signal mod- 
els of the form {C[x]/p{x),C[x]/p{x),^) is obtained from 
the well-known Chinese remainder theorem (or CRT, see 
Appendix I). We assume that the zeros of p{x) are pairwise 
distinct, given by a = {ao, . . . , q:„_i). Then the CRT provides 
the decomposition 

^: C[x]/p(x) ^ eo<fc<„C[.T]/(x-afe), 
s{x) (s(ao), . . . , s(a„_i)). 

The mapping !F is the Fourier transform for the signal model, 
and C[a;]/(a; — ak), < < n, are the spectral components 
ofM= C[x]/p{x). 

To obtain a matrix representation of !F, we choose bases. 
The basis b = (po, • • • ,Pn-i) of A4 is provided by the 
model (namely by <&). In each spectral component, which has 
dimension 1, we choose the basis (a;°). The standard procedure 
to derive the matrix representation for is apply T to the base 
vectors pe, determine the coordinate vectors of the images, and 
place them in the columns of a matrix. By abuse of notation, 
we denote this matrix also by J^. Because 

pe{x) = pe{ak) mod {x - ak), 

we obtain 

^ = VbM = [Pl{ak)]Q<k,l<n- 

We call Vb^a a polynomial transform. It is uniquely deter- 
mined by the signal model. This definition is different from 
Nussbaumer's in [29], [31]. 



Other Fourier transforms for the same model arise through 
the degrees of freedom in choosing the bases in the spectral 
components C[x]/(x — ak). In the most general case, we 
choose a basis {l3kX^) in each component, which yields the 
generic Fourier transform. 

r = diag(l//3o, . . . , l//?„-i)n,a. (6) 

Returning to our previous example A = M. = C[a;]/(a;" — 
1) and $ given in (4), we compute 

C[x\l{x--l)^ C[x\/{x-ujI), 

0<k<n 

where a;„ = e~^'^^/", and thus 

'Pb.a = ['^n%<k,e<n = DFT„ 

is the discrete Fourier transform, which also motivates the 
name finite ^-transform for (4). 

B. Signal Models for DFTs and DTTs 

In this section we provide the signal models for 4 types 
of DFTs, the 16 DCTs and DSTs. We refer to the DCTs and 
DSTs collectively as DTTs (discrete trigonometric transforms) 
even though this class is actually larger (e.g., including discrete 
Hartley transform and real discrete Fourier transforms). Fur- 
ther, we define 4 types of skew DTTs, which were introduced 
in [2], and which are necessary to derive a complete set of 
algorithms. 

Each of these transforms is a Fourier transform for a finite 
shift-invariant regular 1-D signal model. As said before, these 
models are uniquely determined by p{x) (defining A = 
A4 = C[x]/p{x)) and the basis b (defining $). The model 
in turn uniquely determines the associated polynomial Fourier 
transform Vb.a - To characterize an arbitrary Fourier transform, 
we need to specify in addition the diagonal matrix in (6). We 
do this in the following by providing a function / such that 
the diagonal matrix is given by 

Df = diago<^<„(/(a£)), 

where a£ are, as before, the zeros of p{x). 

Due to lack of space, we will not provide in the paper 
detailed derivations of the signal models; we refer the reader 
to [1], [2] for details. 

DFTs. The DFTs are Fourier transforms for finite time 
models. We distinguish 4 types, DFT type 1^. Type 1 and 3 
are special cases of a DFT(a), a e C\{0}, all of which are 
polynomial transforms. 

For example, the signal model associated to DFT (a) is 
given by ^ = = C[a;]/(a;" — a) and $ : s i-^ 
J2o<k<n ^kx'^- The zeros of cc" — a are the n nth roots of 
a and thus straightforward computation yields 

DFT(a) = Vb,a = DFT„ diago<^<„( (7) 

where ^ = |a|i/"e'^J/" for a = |a|e^J. 

DTTs. The 16 DTTs are Fourier transforms for finite space 
models, which are defined in Table III. In contrast to the time 
models, the basis polynomials are now Chebyshev polynomials 
of the first (Tfe), second (Uk), third (Vk), or fourth (Wk) kind. 
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TABLE II 

Signal models associated to the DFTs. 



T 


■p{x) 


b 


f = f{t) 


DFT = DFT-1 




- 1 




1 


DFT-2 




- 1 




1/2 


DFT-3 




+ 1 


X*' 


1 


DFT-4 


a;" 


+ 1 




1/2 

a/ 


DFT(a) 


3;" 


— a 


x'^ 


1 



TABLE III 

Signal models associated to the 16 DTTs (DCTs and DSTs). 



o 



r 


P = P{x) 




b 


/ = f{0), cosO = ai 


DCT-3 


T„ 




n 


1 


DST-3 


T„ 




Uk 


sin(6') 


DCT-4 


T„ 




Vk 


cos(e/2) 


DST-4 


T„ 




Wk 


sin(6»/2) 


DCT-1 


(x2 - !)[/„. 


-2 


Tk 


1 


DST-I 


[/„ 




Uk 


sin(6l) 


DCT-2 


(a; - l)U„- 


1 


Vk 


cos(0/2) 


DST-2 


{x + l)Un- 


1 


Wk 


sin(6l/2) 


DCT-7 


{x + i)y„_ 


1 


Tk 


1 


DST-7 


v„ 




Uk 


sm(6>) 


DCT-8 


v„ 




Vk 


cos(6l/2) 


DST-8 


(x + 1)V„- 


1 


Wk 


sm(6'/2) 


DCT-5 


(x - 1)W„^ 


-1 


Tk 


1 


DST-5 


Wn 




Uk 


sin(9) 


DCT-6 


{X - l)Wn- 


-1 


Vk 


cos(e/2) 


DST-6 


Wn 




Wk 


sin(6»/2) 



See Appendix II for their definition and properties that we will 
use in this paper. 

As an example consider the most commonly used DCT-2„. 
The associated model is given from Table III by ^ = = 
C[x\/{x — l)C/„_i. The zeros of {x — l)f/„_i are given 
by a/c = cos(fc7r/n), < fc < n (see Table XXIII in 
Appendix II). Thus the unique polynomial Fourier transform 
for the model is given by 



cos 



(fc+l/2)7r 



■Pb.a = \ye.{Oik)]Q<k,l<n 

J 0<k,e<n 

Multiplying Vb,a from the left by the scaling diagonal 

diago<fe<„(cos(acos(afe)/2)) 



(8) 



cancels the denominator to yield 



DCT-2„ = [cos 



fc(^+l/2)71 



JO<fc,^<rn 



which identifies DCT-2 as a Fourier transform for the speci- 
fied signal model. 

The definitions of all 16 DTTs are given in Table IV. Types 
I, 4, 5, 8 are symmetric; types 2, 3 and 6, 7 are transposes of 
each other, respectively. 

Every DTT has a polynomial transform counterpart, which 
we write as DTT. For example DCT-2„ is the matrix in (8). 



TABLE IV 

> types of DCTs and DSTs OF SIZE n. The entry at row k and 

COLUMN £ IS GIVEN FOR Q < k,i < n. 



type 



DCTs 



DSTs 



1 cos 

3 cos(fc + |)£^ 

4 cos(fe + l)(r+i)^ 

5 cos 

6 cosk(e+ ^)-^ 

^' n— 

7 COs(fc-|- 

8 cos(fc+i)(£+i)-^ 



sin(fc-t-l)(€ + l)^ 

sm{fc + l){^ + l)^ 
sm(fe + i)(^ + l)^ 
sm(fc + i)(£+i)f 
sin(fe + l)(^ + l)--^ 

sm(A: + l)(£+i)^ 

sm{k + ^){£ + l)^ 

sin(fc-hi)(€+i)-4 



TABLE V 

4 TYPES OF SKEW DTTS AND ASSOCIATED SIGNAL MODELS. THE 
PARAMETERS IS IN < r < 1. FOR r = 1/2 THEY REDUCE TO THE 
T-GROUPDTTS. 



J" 


P 


= p{x) 


b 


f = f{e), cose = at. 


DCT-3{r) 


T„ 


— COS rn 


Tk 


1 


DST-3(r) 


Tn 


— cos rn 


Uk 


sin(e) 


DCT-4{r) 


Tn 


— cos rn 


Vk 


cos(6»/2) 


DST-4(r) 


Tn 


— cos rn 


Wk 


sin(6»/2) 



For the DCTs of types 1,3,5,7, the scaling function is 1 (Ta- 
ble III) and thus they are equal to their polynomial counterpart. 
We will later see that in some cases, the polynomial DTTs 
have a lower arithmetic cost than the corresponding DTTs, 
which makes them suitable choices in application, where the 
transform output is scaled. 

We divide the DTTs into 4 groups, called T-, U-, V-, and 
VF-group depending on p as shown in Table III. Within each 
group, the algebra and module are (almost) the same. This 
leads to sparse relationships between DTTs in one group as we 
have shown in [2]; examples we will use are in Appendix III. 

Further, within a group, DTTs are pairwise dual (they 
have flipped associated boundary conditions [2]), which means 
that they can be translated into each other without additional 
arithmetic operations (see (105) in Appendix III). 

Skew DTTs. We introduced the skew DTTs in [3] since 
their associated signal models are also reasonable space 
models, but, more importantly, because they are important 
building blocks of Cooley-Tukey type algorithms as we will 
show in this paper. There are 4 types of skew DTTs, each 
parameterized by < r < 1. They generaUze the four T- 
group DTTs (DCT/DST of type 3/4) and have the same scaling 
functions as these do. The models that define these transforms 
are shown in Table V. The corresponding polynomial versions 
are again denoted using a bar as in DCT-3„(r). 

To obtain the exact form of these transforms, we need the 

zeros of the polynomial T,j — cos ttt and choose an order of 
these zeros. This is done in the following lemma. 
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Lemma 1 Let < r < 1. We have the factorization 

T„-cosr7r = 2"-i JJ (a;-cos^7r), (9) 

0<i<ra 

which determines the zeros of r„ — cos m. We order the zeros 
as a = (cosroTT, . . . , cosr„_i7r), such that < r, < 1, 
and r, < rj for i < j. The list of the is given by the 
concatenation 

{n)o<e<n= U (^.^) 

0<j<n/2 

for n even, and by 

(r,)o<.<n=( U 

0<i<2iii 

for n odd. In the particular case of r = 1/2 or costtt = 0, 
we thus have a = (cos(f + l/2)7r/n)o<^<n as in Table XXIII 
in Appendix 11. 

For example, the DCT-3„(r) is given by the matrix 

DCT-3„(r) = [coskriTT]o<k,e<n, 

where the ri are provided by Lemma 1 . 

Relationships between the skew DTTs and skew and non- 
skew DTTs are shown in Appendix III. 

111. Background: Fast Transform Algorithms 

In this section, we explain the notation that we use to 
represent and manipulate transform algorithms followed by 
a brief discussion on the quaUty of algorithms. 

A. Representation of Algorithms 

We discuss two representations for transforms' and their 
algorithms. Traditionally, transforms in SP are written as 
sunmiation hke 

2/fc = XI *k,ese, (10) 
o<e<n 

where s = (sq, . . . , s„_i)-^ is the input signal, y = 
(yo, • • • , ?Ai-i)^ the output signal, and tfe,£ the transform 
coefficients. This representation is usually adopted because 
these transforms are thought of as truncated versions of infinite 
series expansions. Correspondingly, algorithms are written as 
sequences of such summations, cleverly organized to reduce 
the operations count. 

A different approach, equivalent in content, represents trans- 
forms as matrix-vector products 



y = Ts, where T = [tk^ 



kj\0<kj<n- 



(11) 



The transform matrix is T, and transform algorithms corre- 
spond to factorizations of T into a product of sparse structured 
matrices. This approach was adopted for the DFT in [35], [32], 
but also for other transforms in various research papers on fast 
transform algorithms. 

'By "transfonns," we mean here those computing some sort of spectrum 
of finite length discrete signals like the DFT or DTTs. 



In the algebraic signal processing theory, we adopt the 
second approach for two reasons. First, transforms (in a very 
general sense) arise in the theory as matrices, namely as 
decompositions of signal models (which includes a chosen 
basis) into its spectral components by base changes. More 
importantly, transform algorithms are derived in the algebraic 
theory through a decomposition of the model in steps, where 
the steps correspond to sparse base changes or sparse matrices. 

Second, we will argue below that there are many advan- 
tages of the matrix representation from an algorithmic and 
implementation point of view. 

Notation. As mentioned above, we represent transform 
algorithms as sparse structured matrix factorizations. These 
are built from basic matrices and operators. 

As basic matrices, we use the n x n identity matrix I„, the 
opposite identity matrix J„ (I„ with the columns in reversed 
order), and the butterfly matrix 



Fo = 



Further, we use permutation matrices; most importantly the 
n X n stride permutation matrix, which can be defined for 
m\n by 

IZ^: ii^+h^hm + i2, < h < ^, < i2 < m. 

(12) 

This definition shows that L^ transposes a ^ x m matrix 
stored in row-major order. Alternatively, we can write 

: i ^ im mod n — 1, for < i < n — 1, 

n — 1 n — 1. 

Since the last^point n — 1 is fixed, we can define an odd stride 
permutation L for m | n -|- 1 as the restriction of LJ^"*"^ to the 
first n points, 

L^ : im mod n. (13) 

Analogous to the stride permutation, (L^)~^ = L(„_|_i)^^, 
and ^ 

Other permutation matrices may be defined by their corre- 
sponding permutation 

P : f{i), 0<i<n, 

which means that the matrix P has in row i the entry 1 at 
position f{i) and else. In this paper, matrix indices start 
wifli 0. 

Diagonal matrices are written as diag(ao, . . . , Q!„_i). 
Other matrices that serve as building blocks will be defined 
as needed. 

Further we use matrix operators, like the product of matri- 
ces, the direct sum 



B 



and the Kronecker or tensor product 

A® B =[ak4B]k4, forA=[ak,(\. 
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Often, we will also construct a larger matrix as a matrix of 
matrices, e.g., 

\ A B ' 
B A ' 

Transposition and Inversion. If an algorithm for a trans- 
form is given as a product of sparse matrices built from 
the constructs above, then an algorithm for the transposed 
or inverse of the transform can be readily derived using 
mathematical properties including 

{ABf = B'^A^, {AB)-^ = B-'^A-\ 
{A®Bf ^ ®B^, {A®B)'^ ^A-^®B-\ 
{A® Bf = A^ ® B"^ , {A® B)-^ = A-^ ® B-^. 

Permutation matrices are orthogonal, i.e., = P^^ . The 
transposition or inversion of diagonal matrices is obvious. Note 
that in general the inverse of a sparse matrix is not sparse. 

Advantages of representation. We believe the structured 
representation of matrices to be advantageous because of the 
following reasons: 1) The representation is more visual and 
thus easier to grasp by human readers than nested sums 
with many indices. 2) Algorithms can be easier manipulated, 
e.g., transposed or inverted, using mathematical properties. 

3) Properties of the algorithm, e.g., orthogonality, or which 
parts can be computed in parallel, can be readily identified. 

4) Finally, the sparse matrix product representation can be 
automatically translated into programs using SPIRAL [36], 
[37]. 

Even though our approach simplifies the derivation of 
algorithms, the sheer number of matrices and cost formulas in 
the remainder of the paper makes it hard to assure correctness. 
We solved this problem using two computer algebra systems. 
Firstly, we used SPIRAL, which includes a modified version 
of GAP/AREP [38], [39], and provides infrastructure for 
working with structured matrices as shown here. This way, 
all formulas in the paper were formally verified (for several 
problem sizes). Secondly, we used Maple [40] to solve the 
numerous recurrences in our cost analysis. 

Quality of algorithms. There are many different factors 
that determine the quality of a given transform algorithm; 
the relative importance of these factors is determined by the 
chosen implementation platform and the specific requirements 
of the application context. While traditionally the arithmetic 
cost of transform algorithms is the focus of analysis, the 
characteristics of modern platforms make the algorithmic 
structure an equally important feature. Further, the numerical 
stability of an algorithm is important to ensure the accuracy 
of the output, in particular for fixed-point implementation. 
Because of this, the knowledge of the entire algorithm space 
for a transform is not just of academic interest. 

Aritlimetic cost. We will analyze the number of operations 
of the algorithms presented below; we will use the notation 
of a triple (a, m, m2), where a is the number of additions or 
subtractions, m2 the number of multiplications by a 2-power 
not equal to 1, and m the number of remaining multiplications 
by constants not equal to — 1. The total operations count is then 
given by f = a + m + m2. 



C[x]/p{x) 




Fig. 1. Basic idea behind the algebraic derivation of Cooley-Tukey type 
algorithms for a Fourier transform J^. 

In many SP publications the term complexity is used for the 
operations count or arithmetic cost . In a strict sense this is 
not correct, since complexity is a property of a problem (like 
computing a DFT), not of an algorithm (like a specific FFT). 
Thus we wiU use the term cost. 

IV. Algebraic Derivation of Fast Transform 
Algorithms for 1-D Polynomial Algebras 

In this section, we start with our algebraic theory of Fourier 
transform algorithms, where the term "Fourier transform" is 
meant in the general sense of the algebraic signal processing 
theory (e.g., including the DCTs, DSTs, and other trigonomet- 
ric transforms). As mentioned before, we consider only finite 
shift-invariant regular signal models, i.e., models of the form 
A = M = C[x]/p{x) and 

0<i<n 

where b = {po, ■ . ■ ,Pn-i) is a basis for M. Further, we 
assume that p has pairwise different zeros, which causes the 
spectrum to consist of distinct one-dimensional submodules. 
The Fourier transform in these cases is given by the CRT (5) 
and is viewed as a matrix (6). 

Assume a transform !F is given as a matrix. The algebraic 
approach derives algorithms by manipulating the associated 
signal model {A,M,^), not by manipulating the matrix 
entries of T. Fig. 1 visualizes this approach for A = Ai = 
C[x]/p{x). We saw in (6) that decomposes C[x]/p{x) 
into one-dimensional polynomial algebras: its spectrum. Fast 
algorithms arise, as we will show, by performing this decom- 
position in steps using an intermediate submodule and associ- 
ated subalgebra. This technique naturally leads to recursive 
algorithms, i.e., algorithms that decompose transforms into 
a product of sparse matrices including smaller transforms of 
the same or a different type. The advantage of the algebraic 
derivation is that it identifies a few general principles that 
account for many different algorithms when instantiated for 
different transforms. Further, the derivation is often greatly 
simplified, as we hope it will become clear, since the only 
task required is reading of base change matrices. 

In this paper, we focus on explaining and deriving, as we 
wiU call them, "Cooley-Tukey type" algorithms. As the name 
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suggests, these algorithms will include, as well as generalize, 
the equally named algorithms for the DFT. The latter will serve 
as examples in this section. Our main focus in the remainder of 
this paper will then be the derivation of analogous algorithms 
for the DCTs and DSTs, most of which have not been reported 
in the Uterature. All these new algorithms are non-orthogonal, 
i.e., are not constructed exclusively from butterflies and 2x2 
rotations. Orthogonal algorithms do exist and will be captured 
algebraically in a future paper. Also "Rader" type algorithms, 
which apply when the above decomposition methods fail (for 
the DFT in the case of a prime size), will be explained in a 
future paper. 

The existence and usefulness of algorithms for one of the 
above signal models rehes on both p{x) and b. Specifically, 
algorithms may arise from two different basic principles, 
which manifest themselves as a property of p: 

1) Cooley-Tukey type (factorization): p{x) = q{x) ■ r{x) 
factorizes; and 

2) Cooley-Tukey type (decomposition): p{x) = q{r{x)) de- 
composes. 

Clearly, 1) is always possible (if we consider the basefield C), 
but 2) is a special property of p. In each of these cases, as we 
will show, we obtain a matrix factorization of J^; its usefulness 
as a fast algorithm, however, depends on b. 

In the remainder of this section, we derive the general form 
of two types of recursive algorithms based on the above. In 
each case the algorithm is derived by a stepwise decomposition 
of = C[x]/p{x) with basis b. We focus on Fourier 
transforms that are polynomial transforms !F = Vb,a- Since 
general Fourier transforms have the form = D A\a,gVb,a, 
D a diagonal matrix, the results can be readily extended. 

A. Cooley-Tukey Type Algorithms: Factorization 

A simple way to decompose €\x]/p{x) in steps is to use a 
factorization p{x) = q{x)-r{x) of p. Formally, let k = deg(g') 
and m = deg(r), then 

c[xyp{x) 

C[x]/qix)®C[x]/r{x) (14) 
- C[x]/(a; - A) ® C[x]/{x-^j)il5) 

0<i<k 0<i<m 

^ C[x]/{x-ae). (16) 

o<e<n 

Here the (3i are the zeros of q and the jj are the zeros of r, 
i.e., both are a subset of the zeros at of p. Both steps (14) and 
(15) use the Chinese remainder theorem, whereas (16) is just 
a reordering of the spectrum. The corresponding factorization 
of the Fourier transform is provided in the following theorem. 

Theorem 1 (Cooley-Tukey Type Algorithm by Factorization) 
Let p{x) = q{x) ■ r{x), and c and be a basis of C[x]/q{x) 
and C[x]/r{x), respectively. Further, denote by /3 and 7 the 
lists of zeros of q and r, respectively. Then 

In particular, the matrix B corresponds to the base change in 
(14) mapping the basis 6 to the concatenation {c,d) of the 



bases c and d, and P is the permutation matrix mapping the 
concatenation (/3,7) to the list of zeros a in (16). 

Note that the factorization of ■p;,,^ in Theorem 1 is useful 
as a fast algorithm, i.e., reduces the arithmetic cost, only if 
B is sparse or can be multiplied with efficiently. Referring to 
Fig. 1, the "partial decomposition" is step (14). 

We consider next two examples: the DFT, which will justify 
why we refer to algorithms based on Theorem 1 as "Cooley- 
Tukey type," and then the more general case of a Vandermonde 
matrix, which is a (polynomial) Fourier transform for the 
generic finite time model. 

Example: DFT. The DFT is a (polynomial) Fourier trans- 
form for the regular signal model given by ^ = = 
C[x]/(x" - 1) with basis b = (1, x, . . . , x""!). For the 
example, we assume n = 2m and use the decomposition 
a;" - 1 = {x™ - l){x"^ + !)• Applying Theorem 1 yields 
the following decomposition steps: 



C[.t]/(.t" - 1) 

C[x]/(x™ - 1) ® C[x]/(x" + 1) 
C[x]/{x-ojI^)(B C[x]/{x 

0<i<m 

C[x]/ix-uji 



0<i<m 



(17) 

(19) 

a<e<n 

As bases in the smaller modules C[x]/{x™ — 1) and 
C[x]/{x"' + 1), we choose c = d = {l,x, . . . ,x"^-'^). We 
note that from this point on the derivation of the algorithm is 
entirely mechanical. 

First, we derive the base change matrix b corresponding to 
(17). To do so, we have to express the base elements x'^ G b 
in the basis (c, d) (concatenation); the coordinate vectors are 
the colunms of B. For < k < m, x'' is actually contained 
in c and d, so the first m colunms of B are 



B = 



where the entries * are determined next. For the base elements 
^m+k^ < fc < m, we have 

^m+k ^ ^k (x" - 1), 



X 



m+k 



-x'' mod (a;'" + 1), 



which yields the final result 
I™ I 



B = 



I. 



-I. 



= DFT2 O I„ 



Next, we consider step (18). C[x]/(x™ — 1) is decomposed 
by a DFT„, and C[x]/(x'" + 1) by a DFT-3,„ (Table III). 
Finally, the permutation in step (19) is the perfect shuffle L^, 
which interleaves the even and odd spectral components (even 
and odd exponents of a;„). The algorithm obtained is 

DFT„ = l:^,(DFT„®DFT-3„)(DFT2®I™). 

To obtain a better known form, we apply the fact that 
DFT-3„ = DFT„-D„, where = diago<i<„«) to 
get 

DFT„ = l;;,(dft„®dft„d„)(dft2®i„) 
= l;;,(I2®dft„)(i„®£)„)(dft2 0U) 
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The last expression is a radix-2 decimation-in-frequency 
Cooley-Tukey FFT; the corresponding decimation-in-time ver- 
sion is obtained by transposition using that the DFT is symmet- 
ric. The entries of the diagonal matrix ®-Dm are commonly 
called twiddle factors. 

Example: Vandermonde matrix. As a second example, 
we consider now the general case of a separable polynomial 
p{x) with zeros ak, < k < n, but keep the basis b = 
{l,x, . . . ,a;"~^) of C[x]/p{x). The associated regular signal 
model is the generic case of a finite time model, the "time- 
ness" being due to the chosen basis, see [1]. The corresponding 
(polynomial) Fourier transform is given by the Vandermonde 
matrix 

= [o^k]o<k,e<n- 

To derive a fast algorithm for JF, we assume that n = 2m. 
We choose an arbitrary factorization p{x) = q{x) ■ r{x) with 
deg{q) = dcg(r) = m and use Theorem 1 to obtain a 
factorization of the form 

T = P{Tr®T2)B, (20) 

where Tx^Ti are Vandermonde matrices for C[a;]/g(a;) and 
C[a;]/r(a;), respectively, and B has the form 



It can be shown that A and N are both a product of two 
Toeplitz matrices [41] and can thus be multiplied with using 
0(nlog(n)) operations. If n is a 2-power, then recursive 
application of (20) hence yields an 0(nlog^(n)) algorithm 
for T. 

Remarks. Theorem 1 is well-known, as it is the CRT 

for polynomials expressed in matrix form. The above DFT 
example is equivalent to the derivation in [20]. Theorem 1 is 
also used as the first step in the derivation of Winograd DFT 
algorithms [26]. There, the polynomial x" — 1 is completely 
factored over the rational numbers, and the DFT decomposed 
accordingly. 

The algorithm derivation method in Theorem 1 is always 
applicable if the basefield is C, but in general the base 
change matrix B will be dense and without useful structure. 
Otherwise, every polynomial transform would have a fast 
algorithm, which by the current state of knowledge is not the 
case. The subsequent methods are different in that respect, 
they require a special property of p(x), and only this property 
leads to the typical Cooley-Tukey FFT structure for general 
radices. 

B. Cooley-Tukey Type Algorithms: Decomposition 

A more interesting factorization of = Vh,a can be derived 
if p{x) decomposes into two polynomials, p(x) = q{r{x)). 
If deg(g') = k and deg(r) = m, then deg(p) = n = km, 
i.e., the degree of p is necessarily composite. In this case, the 
polynomial r{x) generates a subalgebra B of A = C[x]/p{x) 
consisting of all polynomials in r{x). Setting y = r{x) makes 
the structure of B evident: B = C[y]/q{y). 



Let /? = (/3o, • • • , Pk-i) be the zeros of q and let a'^ = 
{a'^Q,..., be the zeros of r{x) — /?,, < i < fc. Then 

p{x) = n 

0<i<fe 

= n n 

0<i<fe 0<j<m 

In particular, each a- ^ is a zero of p. Now we decompose 
C[x]/p{x) in the following steps: 

C[x]/p{x) ^ C[x]/q{rix)) (21) 
^ C[x]/{r{x)-(3i) (22) 

0<i<fc 

^00 C[x]/(a;-<,) (23) 

0<i<fe 0<j<m 

C[x]/{x-ae). (24) 

Q<l<n 

Steps (22) and (23) use the Chinese remainder theorem. 
To derive the corresponding factorization of 'Pb,a into four 
factors, we choose a basis c = {qq, . . . , qk-i) for C[y]/q{y), 
and for each C[x]/{r{x) — Pi) in (22) the same basis d = 
(ro, . . . , Tm-i) . Then, in the first step (21), we do not change 
A but only make a base change in A from the given basis b 
to the new basis 

b' = {roqo{r), • ■ ■ , rm-iqo{r), 

(25) 

roqk-i{r), rm-iqk-i{r)), 

which is a product of the "coarse" basis of the subalgebra 
B < A with the "fine" common basis of the C[x]/ {r{x) — Pi). 
We call B the base change matrix for b^b' . 

Next, we compute the base change matrix M corresponding 
to the coarse decomposition (22) with respect to the basis b' 
in 'C[x\/p{x) and the basis d in each summand on the right 
hand side. Let ri{x)qj{r{x)) G b'. Then 

n{x)qj{r{x)) = n{x)qj{Pi) mod {r{x) - Pi), 

which is qj(Pi) times the ^th base vector re{x) in d. Thus we 
get 

M = lljiPi) ■ Im]o<i,j<fe = ^c,/3 ® Im • 

The third step (23) decomposes the summands in (22) by 
Fourier transforms Vd,a'.^ respectively. The final step (24) 
reorders the one-dimensional summands by a suitable per- 
mutation P. We summarize the resulting factorization in the 
following theorem. 

Theorem 2 (Cooley-Tukey Type Algorithms by Decomposition) 
Let p{x) = q{r{x)). Using previous notation, 

Vb,a = ^( 'Pd.a'^ {Vc,0 ® Im)B, 

Q<i<k 

where B is the base change matrix mapping 6 to b', and P is 
the permutation matrix mapping the concatenation of the a ■ 
onto a in (24). 
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As in Theorem 1, the usefulness of the factorization as fast 
algorithm depends on the base change matrix B. Referring to 
Fig. 1, the "partial decomposition" is step (22). 

Note that the intermediate decomposition step in (22) has 
k summands, whereas the intermediate step in (14) has only 
2 summands. However, this difference is not the point, as 
Theorem 1 could be easily extended to more than 2 sum- 
mands. It is the decomposition property of p{x) that creates a 
subalgebra generated by r(x), which ensures that the conquer 
step is sparse and has the Kronecker product structure Pc,/3 <8i 
I„i, which intuitively is a "coarse" polynomial transform for 
C[x]/p{x). 

As an example we consider again the DFT. 

Example: DFT. Let A ^ M ^ C[a;]/(a;" - 1) with basis 
b= {l,x, . . . , a;"~^) be the regular signal model associated to 
the DFT„. Further, assume that n = km, which is necessary 
for decomposition. 

The polynomial p{x) = a;" — 1 then decomposes 

x" - 1 = {x'^f - 1, (26) 

i.e., p{x) = q{r{x)) with q{x) = x^ — 1 and r{x) = x™. 
Thus Theorem 2 is applicable. The zeros of q{x) are fii = u)]., 
< i < k. Using this theorem's notation, we choose c = 
(1, X, ... , x''^^) as basis in C[x]/q{x), d = (1, x, . . . , x™^-"^) 
as basis in the modules C[x]/(x'" — wl). We find that b' = b 
in (25), which implies i? = I„. 

Thus, the matrix DFT^ (g) performs the following coarse 
decomposition corresponding to (22): 

C[x]/(x"-l)- C[x]/(x™-4). 

0<i<fe 

The modules C[x]/(x™ — i^l) decomposed, respectively, 
by (7), which takes the form 

DFT^iu;l) = DFT„ ■ diag;'ro' K^' ) , 

namely as 

C[x]/{x"^-col)^ C[x]/(x-^^'=+'). 

0<j<m 

At this point, corresponding to (23), C[x]/p{x) is completely 
decomposed, but the spectrum is ordered according to jk + i, 
0<i<m, 0<j<kij runs faster). The desired order is 
im + j. Thus, we need to apply the permutation 

jk + i im + j, 

which is exactly the stride permutation in (12). 

In summary, we obtain the Cooley-Tukey decimation-in- 
frequency FFT with arbitrary radix: 

LZ, ( DFT„ • diag^^lo^ (a;jf")) (DFT^ ® I„) 

0<i<fe 

= L;^(Ifc®DFT„)T;^(DFTfc®I„), (27) 

where the matrix is diagonal and usually called the 
twiddle matrix. Transposition of (27) yields the corresponding 
decimation-in-time version. 

Again, we note that after recognizing the decomposition 
property (26), the derivation is completely mechanical. 



Remarks. Theorem 2 makes use of the CRT (in (22) 
and (23)), but it is the decomposition property of x" — 1 
that produced the typical structure. The previous work on 
the algebraic derivation of this FFT did not make use of 
decompositions. As we briefly discuss next, the decomposition 
is a special case of a more general algebraic principle. 

C. Remarks on Algebraic Principles 

The algorithms derived in this section are based on the 
factorization or decomposition of the polynomial p{x) in the 
signal model provided by C[x]/p(x) (and basis b). This is 
pleasantly simple, but it is also of interest to identify the (more 
general) principle from the representation theory of algebras 
that lies behind that. This is important, as other signal models 
may not be regular or represented by a polynomial algebra in 
one variable, but the algebraic principle may still apply. 

We focus on the decomposition property of p{x) and be 
brief, assuming some familiarity with representation theory. 
The key concept underlying Theorem 2 is induction as implicit 
in step (21). Namely, r(x) generates a subalgebra B = 
(r(x)) < A, which is isomorphic (setting y — r(x)) to 
£-[y\/ q{y). Further, d = (ro, . . . , rm-i) is a transversal of 
B in A, which means >l is a direct sum of the vector spaces 
nB: 

A = rnB® ...®r„,-iB. (28) 

This shows that the regular ^-module is an induction of the 
regular B-module with transversal d: A = B ]dA. The natural 
basis of this induction is b' in (25), which has a structure 
corresponding to (28). The purpose of step (21) is to make this 
induction explicit, and Theorem 2 is a decomposition theorem 
for inductions of (regular modules of) polynomial algebras. 

This is a satisfying insight since in prior work [42], [43] we 
derived the corresponding theorem for inductions of (modules 
of) group algebras, which has a very similar form [42, Th. 2 
in the appendix]. Further, we have shown (also in [42]) that 
at least some of the orthogonal DTT algorithms are based on 
it. Further, we have used already a different generalization of 
Theorem 2, namely to polynomial algebras in two variables 
(which provide two-dimensional signal models) to derive a 
Cooley-Tukey type algorithm in [5] for the new transform 
introduced in [4]. 

V. Cooley-Tukey Type DTT Algorithms 

(Factorization) 

In this section, we derive recursive DTT algorithms by 
applying Theorem 1, i.e., by factorizing the polynomial p in 
the module C[x]/p(x) associated to a given DTT„. To do so, 
we will use the following rational factorizations of Chebyshev 
polynomials. 

Lemma 2 The following factorizations hold for the Chebyshev 
polynomials T, U, V, W: 

i) Ta = x(4x2 - 3) 

ii) U2n-i = 2C/„-ir„. 

iii) U2n = VnWn. 

iv) Vsn+i = 2K(T2„+i - 1/2). 
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V) W3„+i = 2W„(T2„+i + 1/2). 

Proof: Follows from the closed form of the polynomials 
given in Table XXlll and trigonometric identities. ■ 
The factorizations in Lemma 2 give rise to size 3 algorithms 
for DTTs in the T-group and recursive algorithms for DTTs in 
the U-, V-, and VF-groups. These are derived in the following. 
We will not provide a cost analysis in this section, since 
most of the following algorithms are special cases of more 
general Cooley-Tukey algorithms to be introduced starting 
from Section VI. 



A. T -Group DTT Algorithms for Size 3 

We derive algorithms based on Lemma 2, i), i.e., for DTTs 
in the T-group (DTTs of type 3 and 4) of size 3. As an 
example, we consider a DCT-43. We start with the polynomial 
version DCT-43, which is a polynomial transform for C[a;]/r3 
with F-basis (Vo, Vi, V'a) = (l,2x - 1,4^2 - 2x - 1). The 
zeros of T3 are (\/|/2, 0. - V3/2). The factorization T3 = 
x{4:x'^ — 3) yields the stepwise decomposition 



C[x]/T3 

C[x]/x®C[x]/{x^ 
C[x]/x® {C[x]/{x 
C[x]/{x - ^) ® C[x]/x ® C[x]/{x 



2 



)®C[x]/{x + ^ 

2 



(29) 
0(30) 
). (31) 



We start with the base change in (29) and choose in all three 
algebras a F-basis. The base change matrix B is computed 
by mapping (Vb, Vi, V2) and expressing it in the basis on the 
right side of (29). The coordinate vectors are the colunons 
of B. The first column is (1, 1, 0)^. Because of Vi = 2x — 
1 = — 1 mod X, the second column is (—1,0,1)^. The last 
column is obtained from V2 — 4a;^ — 2a; — 1 = —1 mod x 
and \x^ -1x- \ = -Ix = -Vi + Vb mod ix^ - 3 
as (—1, 1, —1)^. Step (30) requires polynomial transforms for 
C[a;]/a; and C[x]/(x^ ~ 3/4) with F-bases, which are given 
by 



[1] and 



Vo(-V3/2) Vi(-\/3/2) 





1 Vs-i 




.1 -V3-1. 



respectively. Finally we have to exchange the first two spectral 
components in (31). The result is 



DCT-43 = 



010 
1 
001 



1 

1 
1 







1 -1 -1 

101 

01-1 



The corresponding algorithm for DCT-43 is obtained by scal- 
ing from left with diag(cos(7r/12),cos(37r/12),cos(57r/12)) 
to get 



DCT-43 = 



010 
100 
001 



/1/2 

cos(7r/12) -^172 
cos(57r/12) -JTpi 



1 -1 -1 
10 1 
1-1 



Similarly we get algorithms for the other DTTs of size 3 in 
the T-group. Those, which are among the best known ones, 
are collected in Table VIII in Section VII. 



B. U -Group DTT Algorithms 

We use Lemma 2, ii) and iii), to derive a complete set 
of recursive algorithms for DTTs that are in the i7-group, 
i.e., for all DTTs of type 1 and 2. As an example, we 
consider the DCT-2„, n = 2m, with associated module 
M = 0^x\l{x - \)Un-Y{x) and l^-basis 6 = (Vb, . . . , Vn-\)- 
From Table XXIII, the zeros of (x — l)?7„_i(a;) are given 
by afe = coskTi/n, < k < n. Using Lemma 2, ii) we 
decompose A4 in steps as 

C[x]/{x-l)U„-i 
^ C[x]/{x - l)U^-i ® C[x]/T„ (32) 

^C[x]/{x - a2k )®0C[a;]/(x-a2fe+i) (33) 

^ 0C[x]/(x-afc). (34) 

We also choose a y-basis b' = {Vq, . ■ . ,Vm-i) in both 
smaller algebras in (32); thus we know they are decomposed 
by DCT-2m and DCT-4„i, respectively. To determine the 
base change matrix B for b {b',b') we need to compute 
Vi mod {x - l)Um-i and Vi mod T^ for < i < 2m. For 
< i < m this is trivial, 

Vi = Vi mod {x - l)Um-i, Vi = Vi mod T„. 

For m < i < 2m this is precisely the signal extension of 
the two smaller algebras in (32) (see [2]). Since the signal 
extension is monomial, B is sparse. The equations are 



V 



m-\-j 



Vm-j-i mod {x - l)Um-i, and 



V,n+j = -Vm-j-i mod T,„. 
Thus, the base change matrix is given by 

n Jr 



2m 



-J. 



= (DFT2®Im)(I™®J„). (35) 



The two summands in (32) are decomposed recursively by 
DCT-2,„ and by DCT-4,„, respectively, to yield (33). Finally, 
we obtain (34) by the permutation matrix L^ (see (12)), 
which interleaves the even and odd afe. As a result, we obtain 
the well-known recursive algorithm [12]: 

DCT-2„ = L^™(DCT-2„ ® DCT-4„)B2m. 

Analogous computations for all transforms in the [/-group 
yield the full set of recursive algorithms due to Lemma 2, 
which are shown in Table Vl(a). The formulas use the follow- 
ing building blocks. The base change matrices B2m in (35) 
and 

Im (^ Jti_ 

10. (36) 

Lr) '^^'-i 

Further, they use the stride permutation matrices L^"', and the 

^2m+l 

odd stride permutation matrices l^^n+i defined in (13), which 
reorder the one-dimensional summands into the proper order. 

Note that the base change matrices i?2m and S2m+i are 
sparse in the last m columns (see (35) and (36)) because of the 
monomial signal extension characteristic for the DTTs. This 
provides another motivation for considering these extensions. 



2m+l 



12 



These four algorithms appeared first in the literature (to 
our best knowledge) in [44], [45], [12], and [46], respectively. 
Combining Table VI(a) with the many ways of translating 
DTTs into each other given by duality or base change (see 
Appendix III) gives a large number of different recursions, 
many of them, however, with suboptimal arithmetic cost. Apart 
from the references above, special cases have been derived in 
[47], [48], [49]. 

Table VI(a) is complemented by the decompositions in 
Table VI(b) which are due to Lemma 2, iii). We did not find 
these in the literature. 

As one application, we can use Table VI(b) to obtain DTT 
algorithms for small sizes, where the smaller DTTs of type 
5-8 are base cases. As a simple example, we get 



{0, . . . , 3m + 1} and is given by 



2^2, for ii =0 

12 + 2m + 1, for z'l = 1 
2i2 + 1, for ii = 2 



— ^m+l 



m+1 



I, 



m+1 



I. 



(l2™+i©Wi). 



To give a visual impression of the structure we show ^3^^ as 
an example: 



DCT-23 



L2(DCT-62eDCT-8i)B3 

Ll/2 -1 



10 
1 
10 



2 ^1 



1 1 

10 
10-1 



(37) 



Transposition yields a DCT-33 algorithm, equivalent to the 
one obtained in Section V-A. 



C. V -Group DTT Algorithms 

In this section, we derive algorithms from Lemma 2, iv), for 
all DTTs in the F-group, i.e., for all DTTs of type 7 and 8. 
Since the second factor in this factorization is T2,i+i — 1/2, the 
skew DTTs (see Section II-B) introduced in [2] come into play. 
We use 1/2 = cos7r/3. We do not give the detailed derivation, 
which is analogous to the one in the previous section, but only 
state the result in Table VI(c) using the following base change 
matrices and permutations. 



1000000000 
0000000100 
0100000000 








00000000 
00000000100 
00010000000 
00001000000 
00000000010 










1 



00000000001 

The permutation -P^™^^ leaves the last point fixed. By 
restricting P^^^ to the set {0, . . . , 3m}, we obtain and define 
the permutation P^~^^. 



D. W -Group DTT Algorithms 

The corresponding W^-group (DTTs of type 5 and 6) algo- 
rithms are given in Table VI(d) with 



B. 



(C5) 
3m+2 



1 

Im Jm 


1 

Im 




-1/2 


I2TO+I 









r(C7) 

-^3m+2 — 


l2m+l 


1/2 

Im 






Im Jm ^ 


Im 




Jm 


r(S5) 

-°3m+l — 




Im 




1 

Im Jm 


-1 

Im 


l2m+l 


Jm 
■■• 



b: 



(S7) 
3m+l 



I2 



:m+l 



^m '^m 



Im 
Jm 
■■■ 



-In 



B. 



(C6) _ 
3m+2 ~ 



d(C8) 
-°3m+l 



Im Jm 


Im 


1 


1 




Im 


l2m+l 


-2 




Jm 





Im 







l2m+l 


--- 




Im ^ Jm 




Jm 


-'^Sm+l ^ 






Im Jm 


Im 


l2m+l 







Im 


o(S8) _ 
-°3m+2 — 


l2m+l 


2 


Jm 


Im Jm 


Im 




1 


-1 



1. 



Im 
■■■ 
Jm 



and Q^"*"^ operates on {0, ... , 3m + 1} as 



Q 



3m+2 



To give the permutation, we decompose the index i into 
the radix-3 format ii + 3^2. Then P^"*"*"^ operates on the set 



i2, for ii = 0; 

= Zi -t- 3i2 i-^- ^ 2«2 + m + 1, for ii = 1; 

_ 2^2 + m + 2, for 12 = 2. 

Im+l 

l2m+l 



•r3m+2 
■^m+1 
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TABLE VI 

DTT ALGORITHMS BASED ON FACTORIZATION PROPERTIES OF THE CHEBYSHEV POLYNOMIALS. TRANSPOSITION YIELDS A DIFFERENT SET OF 
ALGORITHMS. REPLACING EACH TRANSFORM BY ITS POLYNOMIAL COUNTERPART YIELDS ALGORITHMS FOR THE POLYNOMIAL DTTS. 

(a) f7-group: Based on U2n-i = 2Un-iT„ 



DCT-l2m,+l — l^m+l (DCT-lm_|_i ® DCT-3m)-B2m+l 

DST-l2^-i = L^"'(DST-3^©DST-U_i)S2r„-i 
DCT-22™ = L^'"(DCT-2™eDCT-4^)B2^ 
DST-22,„ = L^'"(DST-4^eDST-2^)B2m 

(c) F-group: Based on Vsn+i = 2(T2„+i - 1/2)K 



DCT-73m+2 = 


p3m+2 


DST-73,n+i = 


p3m + l 

~ m 


DCT-Sa^^+i = 


p3m + l 


DST-83m.+2 = 


p3m+2 



a(C7) 
^3m+2 



(DST-32„+i(i) ffi mT-lm)Bf^l^ 



n{S8) 



(b) f7-group: Based on U2n = VnW„ 

DCT-l2^ = L''^iBCT-5meBCT-7m)B2m 
BST-hm = Li"'(DST-7^®DST-5„)S2„ 

^2m+l 

DCT-22m+l = L„+i (DCT-6m+l ® DCT-8m)-B2m+l 
DST-22m,+ l = Lm+1 (DST-8m+l ® DST-6TO,)-B2m,+ 1 

(d) -group: Based on Wsn+i = 2W„{T2„+i + 1/2) 

DCT-53^+2 = Q^^+\DCT-5m+i(BBCT-32m+i{l))BiZ% 
DST-53„+i = Q^"'+i(DST-5„eDST-32™+i(|))B:^f_|i 

DCT-63^+2 = Q^'"+'(DCT-6„+i ®DCT-42™+i(|))B<^'!^2 
DST-63^+1 = Q^™+^(DST-6^©DST-42^+i(|))b(^'':|i 



The pemutation Q^"*"^ leaves the first point fixed. By 
restricting Q^™^'^ to the points 1 < i < 3m + 1 we obtain 
and define the permutation If we rename the index 

set into {0, . . . , 3m}, we have 



Q 



:3m+l _ 



= ii + 3i2 



2i2 + m, 
2i2 + m + l, 
i2, 



for ii = 0; 
for ii = 1; 
for ii = 2. 



The usefulness of the above algorithms depends on the 
initial transform size and on the availability of algorithms for 
the occurring skew DTTs. These algorithms will be introduced 
later. 

E. Polynomial DTTs 

Every DTT in Table VI is decomposed into two DTTs that 
have the same base polynomials. Thus they have the same 
scahng function (see Table III: b and / are connected), which 
is the reason why we see no scahng factors in the equations. As 
an important consequence, we get algorithms corresponding to 
Table VI for the polynomial transforms DTT. 

As an example, we derive the polynomial equivalent of (37): 



DCT-23 



100 
001 
1 



([1-^2 



Ii 



10 1 
10 
10-1 



(38) 



where DCT-23 = diag(l,^,i) • DCT-23. The algorithm 
requires 4 additions and 1 multiplication and is thus 1 multi- 
phcation cheaper than its non-polynomial equivalent (37). 

F. Final Remarks 

The algorithms given in this section are based on Lemma 2, 
which provides factorizations of the Chebyshev polynomials 
T, U, V, W. Since all these polynomial factorizations are ra- 
tional, the associated matrix factorizations are also rational. In 
Lemma 2, ii) and iii), the factors are again Chebyshev poly- 
nomials, and thus the smaller transforms in the decomposition 
are again DTTs. In Leimna 2, iv) and v), the second factor 
^271+1 — 1/2 leads to skew DTTs (see Table V). The complete 



rational factorization of the Chebyshev polynomials T„, [/„ for 
arbitrary n is given in [50]. The rational factorization of 14, 
and Wn can be derived using [50] and Lemma 2, iii). These 
factorizations can be used to decompose a DTT, but the smaller 
transforms obtained are in general no DTTs or skew DTTs. 

All algorithms in Table VI can be manipulated in numerous 
ways using the identities in Appendix III or transposition to 
obtain different algorithms. 

vi. cooley-tukey type dtt algorithms 

(Decomposition) 

In this section, we give a first overview on DTT algorithms 
that are based on Theorem 2, i.e., on a decomposition p{x) = 
q{r{x)) of the polynomial p in the associated algebra C[x]/p. 
These algorithms are, structural and in a precise mathematical 
sense, the equivalent of the Cooley-Tukey FFT (27), which we 
derived based on the decomposition of a;" — 1 = (x™)'^ — 1. 

We will see that all 16 DTTs possess such algorithms, 
and that in many cases there are several reasonable variants 
with different characteristics to choose from. Some of these 
algorithms generahze the ones we introduced in Section V. 

Each of these "Cooley-Tukey-hke" DTT algorithms exhibits 
the same flexible recursion and regular and versatile structure 
that has been the success of the FFT. As a consequence, all 
FFT variants optimized for, e.g., parallel or vector computation 
will have counterparts for the 16 DTTs. See [32] for more 
details on FFT variants. 

Only very few special cases of these algorithms have been 
found before. Again, our algebraic methods show their power: 
the derivation using Theorem 2 is comparatively easy, since 
only base changes have to be computed; in contrast, a deriva- 
tion based on matrix entries becomes hopelessly complicated, 
and, furthermore, does not provide a guidehne to which entry 
and how manipulations should be performed to obtain an 
algorithm. 

Decomposition of Chebyshev polynomials. The DTT al- 
gorithms are based on the following lenmia, which provides 
decomposition properties of the Chebyshev polynomials. 
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Lemma 3 The Chebyshev polynomials T, U, V, W have the 
following decomposition properties: 

i) Ti^m = Tk{Tm)', Tkm — a — Ti^iTjn) — a, a £C. 

ii) Ukm-l = Um-l ■ Uk-l{Tm). 

iii) V(fe_i)/2+fcTO = Vm ■ V(fe_i)/2(T2m+i). 

iv) W^(fe-l)/2+fem = W.m ■ W^(fc_i)/2(T2„+i). 



vi) C/fc 



m+7n/2 — 1 



72-1 



Wfe(T„). 



Proof: Straightforward using the closed form of T„ from 
Table XXIII. In particular i) is well-known in the literature 
(e.g., [51]). ■ 

Inspecting the identities in Lemma 3, we observe that only 
i) provides a pure decomposition; the other identities are 
a decomposition up to a factor. Thus, in these cases, the 
algorithms derivation requires us to first apply Theorem 1 and 
then Theorem 2. 

Also, we observe that Lemma 3 provides decomposition of 
all four types of Chebyshev polynomials. Thus we can expect 
Cooley-Tukey type algorithms for all 16 DTTs. Looking at 
Lemma 3, Theorem 2, and its derivation in (21)-(24), we see 
that the algebras in (22), will all have the form 

C[a;]/(T„-cosr7r), 

and thus the decomposition (23) will require skew DTTs, 
which motivates their introduction in [2]. Of course, this poses 
the question how to further decompose the skew DTTs for non- 
trivial sizes. This question is answered by the second identity 
in Lemma 3, i): Tn — a. decomposes exactly as T„ does, which 
estabhshes a one-to-one correspondence between algorithms 
for the DTTs in the T-group and their skew counterparts. 

Fig. 2 gives an overview on the algorithms that we will 
derive from Lemma 3. We first organize the algorithms into 
the groups of DTTs (see also Table III) they apply to. In the T- 
and /7-group, we have two types of decomposition properties. 
For algorithms based on T„ = Tfc(Tm), we have three further 
degrees of freedom as will be explained later. In summary, 
each leaf of the tree in Fig. 2 represents one class consisting of 
four algorithms for each of the DTTs in the respective group. 

None of these algorithms is orthogonal, i.e., they do not 
decompose the DTTs into rotations (and butterflies). Orthog- 
onal Cooley-Tukey type algorithms are the subject of a future 
paper. 



VII. T-Group DTT Algorithms 

In this section we derive the four classes of Cooley-Tukey 
type algorithms for the four DTTs in the T-group shown in 
Fig. 2. We focus mainly on those algorithms based on T„ = 

First, we simultaneously derive the algorithms for all four 
DTTs to emphasize their common structure and their differ- 
ences. The exact form of the algorithms, i.e., all occurring 
matrices, will be derived afterwards, including a discussion 
and cost analysis in each case. 



A. Simultaneous Derivation 

We start with a fixed DTT in the T-group with associated 
algebra C[x]/Tn and C-basis b = (Co, . . . , C„_i), where 
C e {T, U, V, W} depends on flie chosen DTT. We assume 
n = km, and use the decomposition T„ = Tk{Tm)- The 
decomposition steps (21)-(24) leading to Theorem 2 take the 
form 

C[x]/T„ C[,T]/Tfc(T„) (39) 

C[x]/(T„-cos^7r) (40) 

0<i<fe 

^00 <C[x]/{x-cosn,jn) (41) 

0<i</c 0<j<m 

C[x]/(x-cos^7r), (42) 

0<i<n 

where the rj^ are determined by Lemma 1. 

In the first step (39), we change bases in C[x]/T„ = 
C[x]/Tfc (Tm), from the given C-basis h to the basis b' given in 
(25). The question arises, which basis to choose in the coarse 
algebra C[x]/Tfc, and which common basis to choose in the 
"skew" algebras C[x]/(T„ - cos(i -|- l/2)7r/fc). In the latter 
ones, we always choose the same C-basis as in the original 
algebra. For the coarse algebra, it turns out that we have two 
reasonable choices: a T-basis or a /7-basis. We consider both 
cases, starting with the J7-basis. 

[/-basis. We choose, independently of C, a [/-basis in 
C[x]/Tfe. Note, that the corresponding DTT is a DST-3„ 
(Table 111). The basis h' in (25) is then given by 



(CoC/o(Tm), • . • , Cm-lUo{Tm), 
CoUk-l{Tm), ■ ■ ■ , Cm-iUk-l{T„i)) 

{CjU,{Tm) \ 0<i<m, 0<j<k). 



(43) 



We order double indices always lexicographically = 
(0,0), (0,1),.... 

We denote the corresponding base change matrix b ^ b' 

— (*) 

in (39) with B„ f.. Here, and in the following, the "*" in the 
superscript means that the matrix depends on the DTT. It will 
later be replaced by * e {C3, 53, C4, SA} when the precise 
definitions are derived. 

After the base change, the decomposition is straightforward 
following steps (40)-(42) and Theorem 2. The coarse decom- 
position in step (40) is obtained with the matrix DST-3fc ®Irn, 
since Theorem 2 requires us to choose a polynomial transform 
for the coarse decomposition. For step (41), we need a direct 
sum of skew DTTs: 0o<,<fe DTT„((i + l/2)/fc). These are 
of the same type as the DTT we started with, since they have 
the same C-basis as the DTT to be decomposed. 

Finally, we order the one-dimensional summands in 
step (42) using a permutation. This permutation does not 
depend on the basis, but only on the zeros of Tk and Tm- 
Thus it is the same in all four cases of DTTs in the T-group, 
and, using Lemma 1, takes the form 

K^;^ = (ifeeJfeeifeeJfce...)L;^. 
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Cooley-Tukey type DTT algorithms (by decomposition) 




T-group 



[/-group 



V-group 



VmV, 



(l2m+l) 



(/c-l)/2k-'2m+l 



M^-group 



Tk{Tm) T^/2Vk{Tm) Um-\Uk-l{Tm) t^m/2-lW^fc(T'm) 




[/-basis 



as IS 



inverse-transpose 



T-group: DCT-3, DST-3, DCT-4, DST-4 

[/-group: DCT-l.DST-l,DCT-2,DST-2 

V-group: DCT-7,DST-7.DCT-8,DST-8 

W^-group: DCT-5,DST-5,DCT-6,DST-6 



Fig. 2. Overview of Cooley-Tukey type algorithms due to decomposition properties of the Chebyshev polynomials. 



This permutation is the equivalent of the stride permutation 
Lj^, occurring in the Cooley-Tukey FFT, for the DTTs in the 
T-group. 

In summary, we obtain 



DTT„ = 

K:;( DTT„ 

0<i<fc 



,(^)) (DST-3, (44) 



The question that remains is how to decompose the smaller 
transforms: the skew DTT^'s and the polynomial DST-3fc. 
However, this poses no problem. Since for any a € C, 
T„ — a decomposes exactly as T„, we derive in a completely 
analogous way the "skew version" of (44) as 



DTT„(r) = 

k;;,^ DTT„(ri))(DST:3fc(r) 



(45) 



0<i<fc 



which is a generalization of (44), which arises for r = 1/2. 
The numbers Vi are computed from r using Lemma (1). The 
matrix KJ^j neither depends on the type of DTT, nor on r; the 

matrix i?„ does depend on the type of DTT, but not on r, 
since the bases h and b' are independent of r. 

For k = n, (45) translates a DTT in the T-group into a 
DST-3, which is a special case of the translation by base 
change in Appendix III. 

Further, since DTTs and skew DTTs have the same scaUng 
function (Tables III and V), we obtain corresponding algo- 
rithms for the polynomial version of the transforms by just 
replacing each DTT by its polynomial counterpart: 



DTT„ 



K' 



0<i<fe 



DTT„(iii^))(DST-3fe(^I 



and 



DTT„(r) = 



The remaining task is to compute the exact form of ^. 
in the four cases. We will do this in Section VII-B and only 
mention at this point that in each case, Bn,k has a very sparse 
and regular structure. 

Next, we derive the analogue of the above algorithms, if a 
T-basis, instead of a [/-basis is chosen in the coarse module 
CN/Tfe. 

T-basis. In distinction with the above, we choose this 

time, independently of C, a T-basis in C[.T]/Tfe. Thus, the 
corresponding DTT is a DCT-3to- The basis h' in (25) is now 
given by 

h' = (CoTo(Tto), . . . , Cto_iTo(Tto), 

CoTfe_i(TTO), ■ ■ ■ , Cm-lTk-l{Tm)) 

= (Ci„_,/2 + Cim+jl2 I < i < m, < j < 

(46) 

using (104) in Appendix 11. We denote the base change matrix 
for&^&'byB«. 

The coarse decomposition in step (40) is now performed by 
the matrix DCT-3fe ® Im (note that DCT-3 is a polynomial 
transform). The remaining steps (41) and (42) are equal to 
what we had before. 

As a result, we obtain 



DTT„ = 

Ka(^ DTT„( 



a + l/2 - 
k . 



|(DCT-3fe(^I„)S^*[, (47) 

~0<i<fe 

and its generaUzation to the skew DTTs 
DTT„(r) = 

K;^(0 DTT„(ri))(DCT-3fe(r)®I„)BW. (48) 

0<j<fe 

Again, i?^ ^ only depends on the type of DTT, and not on r. 

The polynomial version is again given by simply replacing 
all DTTs by their polynomial counterparts: 



0<i<fe 



DTT„(r) = 

K;^ ( DTT„(r,)) (DCT-3,(r) ® I„)bW . (49) 



0<i<fe 
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We mentioned above that choosing a ?7-basis in the coarse 
module C[x\/Tk leads to base change matrices Bn,k that are 
sparse (which will be shown in detail below). For the T-basis, 
this is somewhat different. In fact, inspecting (46) shows that 
the inverse base change b' — *■ b, i.e., B~\. is sparse (with at 
most two entries in each column). For this reason, we will 
also consider the inverse of (47) and (48). 

T-basis inverted. To express the inverse, we need the 
inverse skew DTTs (Appendix HI). The inverse of (48) wiU 
take, after minor simplifications, in each case the general form 

iDTT„(r) = ((7W)-i(iDCT-3fe(r) 



( iDTT„(rO) M^, (50) 



0<i<fe 



where 



Ml 



TO"! = L^(ifceJfceifceJfce...), 



where 



and C„ ^ is closely related to B]^ ^. (50) provides algorithms 
for the DTTs of type 2 and 4 (the inverses of the DTTs in the 
T-group). 

Variants. The algorithms derived above can be further 
manipulated to obtain variants. We saw already an example: 
the inversion of (48) to obtain (50). One obvious manipulation 
is transposition, which turns each T-group DTT algorithm into 
an algorithm for a DCT or DST of type 2 or 4 (the transposes 
of the T-group DTTs). 

More interestingly, each of the above algorithms has a cor- 
responding "twiddle version," which is obtained by translating 
skew DTTs into their non-skew counterparts using (108)-(1 11) 
in Appendix III. For example, the twiddle version of (47) is 
given by 

DTT„ = 

Kl^ {ik ® DTT„) Dk,m (DCT-3fc ® lm)Bn,k , (51) 

0<i<k 

is a direct sum of the x-shaped matrices in (I08)-(III) 
(Appendix III). 

The twiddle version seems more appealing; however, we 
will later see that at least in the 2-power case n = 2^ they 
incur a higher arithmetic cost. The reason is that skew and 
non-skew DTTs can be computed with the same cost in this 
case. For other sizes, the twiddle version may not incur any 
penalty. Most state of the art software implementations [52], 
[36] fuse the twiddle factors with the subsequent loop incurred 
by the tensor product anyway to achieve better locality. 

Base cases. We provide the base cases for the above 
algorithms for size n = 2 in Table VII and for size n = 3 in 
Table VIII. The size 2 cases follow from the definition; most of 
the size 3 cases were derived in Section V-A. An exception is 
DCT-4, for which the algorithm was generated by AREP [42], 
[39], and which is a "Rader-type" algorithm (see [10]; also a 
future paper will discuss the algebraic origin of the Rader 
algorithm in detail). The DST-43 algorithm follows then by 
duality (105) (in Appendix III). 



TABLE VII 

Base cases for normal and skew T-group DTTs of size 2. 



DCT-32 = Fa • diag(l, 1/V2) 
:F2•diag(l,^/2) 

[o v/sj 




DST-32 = 
DCT-42 : 
DST-42 = 



= F2 

:F2- 



DCT-32 = DCT-32 
DST-32 = F2 • diag(l/A/2, 1) 
DCT-42 = diag(cos f ,sin f ) ■ F2 
DST-42 = diag(sin -I, cos J) • F2 ■ 




DCT-32 (r) 
DST-32 (r) : 



DCT-42 (r) 



= F2 • diag(l, cos ^7r) 

= F2-diag(l,2cos=) 
_ n -1 " 



DST-42 (r) 




DCT-32 (r) 
DST-32 (r) 



= DCT-32 (r) 



DCT-42 (r) 


= diag(cos ^ 




•F2- 




DST-42 (r) 


= diag(sini|;, 


cos^f) 


F2- 





iDCT-32(r) : 
iDST-32(r) = 

iDCT-42(r) : 
iDCT-42(r) : 



: diag(l, - — 3 

^ '2 cos 

diag{2ii^ 
1 1 - 
" 7 — ^-nr 

I COS —rr- 

1 -1^ -- 
■ 



■F2, 



2 J 



- Fa 

F2 - diag( 



2 cos z sin 



F2 ■ diag( . rii 



2 sin — 2 cos —r- 



TABLE Vm 

Base cases for normal and skew T-group DTTs of size 3. 



r 1 1 

DCT-33 =010 
Li -1 

ro 1 1 

DST-33 =10 

Lo 1 -1 



DCT-43 



DST-43 : 



1 \/3-l 

1 
Lo 1 -v^-1 

1 \/S+l 
10 
1 -v^+1 



1 1/2 
10-1 

\/3/2 

1 ~1" 

10 2 


1 -1 -1 
10 1 
1 -1 



11-1 
1 1 

oil 



DCT-33 = DCT-33 
DST-33 

DCT-43 
DST-43 





10 11 






1 






10-1. 






r 1 -1 






1 






L 1 1 






1 -1 01 






1 


[ 




1 1 J 





1/2 1 
1 0-1 
\/3/2 

■ 1 " 

1-1 
-2 -1 

'10 1,- / 
1-1 diagf 
2 1 J 



diag( 



DCT-33 (r) : 



DST-33 (r) : 





1 


1 


1 


(lie 






1 


-1 









1 





-1 








1 


1 


1 








1 


-1 





(^Il©2 




1 





-1 





, \ / l — 2r V 
COS( ^ TV) COS( ^ TV) 

os(i 

cos{ 

/ 1 
cos( — 



(1+1 

^ 3 



-tt) cos( — 
tt) cos(- 



-vr) 



t) cos( -M 



ri 1] 

10 

Lo 1 J 



DCT-43 (r) = by definition 
DST-43 (r) = by definition 



DCT-33 (r) = DCT-33 (r) 
DST-33(r) = diag(sin|7r, 
DCT-43 (r) = by definition 
DST-43 (r) = by definition 



,^7r)DST-33(r) 
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The remaining task is to derive the exact form of the 
base change matrices, which are the only parts of the above 
algorithms that depend on the DTT. We will do this in the 
remainder of this section including a cost analysis for the most 
important cases and sizes. 



B. Details: T-Group and U-Basis 



j(*) 



for 



In this section, we compute the exact form of , 

* e {(73,53,(74,54}. We derive B^^^ as an example in 
detail. The others are derived analogously and only the result 
will be presented. 

Derivation of base change matrices. We consider DCT-3. 

The matrix b[. ~ B^^,^ in (44) performs a base change in 
C\x]/Tn from a T-basis to the basis b' in (43) with C = T. 

(C3) 

To compute B/. ^ we have to express every element Tj in b 
as a linear combination in b'. To do this, we first write b as 

b = {Tim+j \ 0<i<m, 0<j<k). 

We did not change b, but only decomposed the index into 
a radix-TO representation. The double indices are ordered as 

usual lexicographically: = (0,0), (0,1) Similarly, 

we write b' as a special case of (43) 

b' = (TjU^iXm) I < « < m, < i < fc). 

First, we consider the case j = 0. From Table XXIV, we know 
that Ti = {Ui - Ui-2)/2 and thus 

Tim = Ti{Tm) = \Ui{Tm) - \Ui-2{Tm) (52) 

is the desired representation in b' . 

Now, let j ^ 0, i.e., 1 < j < m. We claim that 

Tim+j ~ TjUiiTm^ Tm—jUi—\{Tm^- (53) 

To prove it, we define the recursion 

Pi = Tj, 

Pi+l = "^TmTi — Ti-i. 

First, because of (104) (Appendix III), we see that 

Pi+l ~ Tim-\-j^ 

which is the left hand side of (53). On the other hand, using 
(103) (Appendix II) with T„i as variable, Pi+i is precisely the 
right hand side of (53), as desired. 

The equations (52) and (53) define the colunms of the base 
change matrix, which is thus given by 

_ 1 

2 



-5(C3) 



Im-1. 



(54) 



For example, all rows with an index that is a multiple of m 
are determined by (52) and thus contain the numbers 1/2. 
Using 



im+j 



CjUiiTm) ~ Cj—mUi—i{Tm), 



which generalizes (53), yields the base change matrices in the 
other three cases: 



B 



.(S3) 
k,m 



T 7 



B 



(C4) 



^k,m 



Im Jm 



Im Jm 



Exact forms of algorithms. Table IX summarizes the exact 
form of all algorithms based on (45). Each algorithm has a 
polynomial counterpart obtained by replacing the DTTs by 
DTTs. 

C. Details: T-Gwup and T-Basis 

In this section, we derive the exact base change matrices 
for the algorithms in (47), (48), and (50). 

Derivation of base change matrices. Again, we use the 
DCT-3„ as detailed example. The matrix i?^*^ = B^^^ in 
(47) performs a base change in C[a;]/T„ from a T-basis to the 
basis 

b = (Tim—j/'^ "I" Tim+j 1'^ I < i < m, < j < fc), (59) 

a special case of (46). It is not straightforward to determine 
Bj, ^ . However, the inverse base change is easy to compute 
due to the form of b': (59) already expresses the elements 
of b' as a linear combination of the elements of b. Further 
simplification in (59) is obtained for the special cases i = 
(which determines the first block of length m), namely 



T, 



~t~ Tim+j 



and for the special case j 
column), namely 



r_,/2 + T,72 = T,-, 
(which determines every mth 



Hence we get 



Tim—jl^ ~t~ Tim+j 1^ — Ti<^ 



2 * Im '^m 

T' 7 

^m 



B 



(C3)N-1 
fc.m / 



Ito 

I' 



(60) 
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TABLE IX 

COOLEY-TUKEY TYPE ALGORITHMS FOR DTTS IN THE T-GROUP, BASED ON THE DECOMPOSITION Tj.^ = Tfe (Tm ) AND A iJ-BASIS CHOSEN IN THE 
COARSE SIGNAL MODEL. THE POLYNOMIAL VERSIONS ARE OBTAINED BY REPLACING ALL TRANSFORMS BY THEIR POLYNOMIAL COUNTERPART. 

Transposition YIELDS algorithms for DCT and DST type 2 and 4. 



DCT-3„ =DCT-3„(l/2), DCT-3fc^(r) = K;^( DCT-3^(ri))DST-3fc(r-) ® U)^!^^' 

()<i<k 

DST-3n =DST-3„(l/2), DST-3fc^(r) = k;; ( DST-3^(ri)) (DST^fcW ® 

0<i<fe 

DCT-4„ = DCT-4„(l/2), DCT-4fc^(r) = Kl^{ DCT-4^(ri))DST:3fc(r) 1^)5^' 

o<«fc 

DST-4„ =DST-4„(l/2), DST-4fc^(r) = k;^( DST-4^(r-<)) (DSl^fcW ® U)^^ 

0<i<k 



(C4) 
k,m 

(S4) 



(55) 
(56) 
(57) 
(58) 



with 





1 



1 



Im, — 2 ® Im— 1 • 



All the multiphcations in (60) can be pulled out to the right 
and we get 



(C3)N,-l/p,(C3)N,-l 



with 



T 7 



(61) 



and the diagonal matrix 

) = I™ ® diag(l, 1/2, . . . , 1/2)). 

To determine S„ j, , we first analyze the block structure. 
Investigation shows that (61) consists of k blocks of size 1 
at positions jm, < j < k, and m — 1 blocks of size k 
corresponding to the index sets 

{Om + i, 2m ± i, Am zti, . . .{k — l)m ± i} (k odd), 
{Om + i, 2m ± i, 4m ± i, . . . km — i} {k even), 

for < i < m. These index sets (and thus the corresponding 
blocks) are obtained by starting at entry (?', i), < i < m, of 
(61) and collecting non-zero entries in a zigzag pattern going 
alternately to the right and down. Each of these m — 1 blocks 
has the form 



Sk = 



1 1 
Oil 



1 1 
1 



(62) 



Using the block structure, we can now write ]. as 

(cf™)"' = (Ifee5fee...eSfe)'3".^ 
= (Ifee(I„.-l05fe))«'•.^ 



with a suitable permutation Qn.fe (the precise form is not of 
importance here). Inversion yields 



(63) 



Multiphcation with the inverse of Sk, i.e., {yo, ■ ■ ■ , t/„_i)^ = 
S^^{xo, ■ ■ ■ ,Xn-i)'^ can be done with the fc — 1 recursive 
subtractions 



yn-l = Xn-1, yn-2 = Xn-2 - J/n-l, ■ ■ ■ , UO = ~ yi, 



i.e., the critical path of 5*^7^, and thus the one of C'l^^"^ has 
length fc — 1. Hence, k should be small to yield an efficient 
algorithm. For example, for k = 2, 



B^S = (Imediag(l,l/2,...,l/2))-i 
= (I„©diag(l,2,...,2)) 



'-m 



I. 



(64) 



On the other hand, C~\. in (61) has a very short critical 
path of length 1. This explains the motivation to invert (47) 
in Section VII-A to obtain (50). Doing so for the DCT-3 
considered here, it turns out that all scaling factors cancel out, 
and we obtain the beautifully simple form 

iDCT-3„(r) = (Cf,'Vi(iDCT-3fc(r) ® U) 



( iDCT-3„(rO) M^, (65) 



0<j<fe 

where 

m^ = (k;;)-i =L^(ifc©jfc ©ifc©jfc ©...). 

Equation (65) gives a class of algorithms for DCT-2„ = 
iDCT-3„(l/2), and, by transposition, we obtain again an 
algorithm for DCT-3„. 

The base change matrices for the other three T-group DTTs 
are obtained analogously. We only give the result. 
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J ri 



_r(S4) 
A; ,m 
_r{C4) 



(66) 



and 



(S3) 
2,m 

(C4) 
2,m 



(S4) 
2,m 



2T 

-J, 

2I„ 

2 It77, 



Exact forms of algorithms. Tables X and XI show the 
final algorithms, which are special cases of (48) and (50), 
respectively. 

Replacing all transforms by their polynomial counterparts 
gives the corresponding algorithms for the polynomial DTTs. 
Further, each algorithm in Tables X and XI has a correspond- 
ing twiddle version as shown in (51). 

Special cases. We briefly discuss the special case (68) for 
k = 2. (64) incurs multiplications by 2, which can be fused 
with the multiplications incurred by the adjacent DCT-32(t'). 
Namely, using DCT-32(r) = F2 • diag(l, cos §77) (Table VII), 
we can manipulate (68) to take the form 



DCT-3„(r) = K;^(DCT-3„(§) ( 



DCT-3™(2^)) 



(76) 



where 



En 



I. 



— Zrr 



COS §7r(Ii I 



32U_i). 

We also briefly consider the case fc = 3 in (68). In this case, 

Im Zrn F 



B. 



(C3) 
3,m 



(Im®(l2(»diag(l,2,...,2))) 



Ir, 



where I^ = diag(0, 1, . . . , 1). Note that this matrix, as 
mentioned before, requires only 2(m — 1) additions, since 
TO — 1 additions are duplicated (row 1, columns 2/3, and row 

(C3) 

2, colunms 2/3). However, the critical path of ^ has then 
length 2. Again, the multiplications can be fused with the 
adjacent DCT-33(r). We omit the details. 

Similarly, the multipUcations by 2 in (67) can be fused in 
(69)-(71). 

D. Alternative Decomposition 

In this section, we discuss briefly algorithms based on the 
decomposition 



The algorithms are for DTTs in the T-group and it turns 
out that the U -basis is the best choice for the coarse module 
C[x]/Vk{x). Thus, a simultaneous derivation yields, for n = 
km + m/2. 



DTT„ 



Oii(DTT„/2' 



DTT„ 



(DST-7fc . 



(*) 
k.m 



(77) 



Note that DST-7 is the polynomial transform for C[x]/Vk{x) 
with [/-basis (Table III). 

Next, we determine the best choice of size n. Inspecting (77) 
shows that, ideally, to = 2* is a 2-power and k = (3* — 1)/2 is 
the natural size for the DST-7 (explained later in Section IX). 
Thus n = 3*2*^^, which is a size that is well handled by 
the algorithms in Section Vll. For this reason, we omit the 
exact forms of the algorithms and only note that the base 
(^^) change matrices -B[*^ have structure similar to the structures 
in Sections IX and X. 

DCT, type 3 and 2, size 5. Above we established that 
ideally = (3* - l)/2, the ideal size for a DST-7. However, 
if k is small enough, namely A; = 2, the algorithm (77) is 



still useful. In particular, if m = 2, then it yields algorithms 
for sizes n = 5. We use DCT-3 as an example. It turns out 
(by trial and error) that in this case a y-basis is slightly 
superior in C[a;]/V2, and, after a minor manipulation, we 
get the algorithm in Table Xll. The cost can be read off as 
(12, 6, 1). Transposition yields an algorithm for DCT-23 with 
identical cost, which is only slightly worse than the (13, 5, 0) 
algorithm in [53]. 

E. Analysis 

In this section we analyze the algorithms in Tables IX, X, 
and XI with respect to arithmetic cost and other characteristics. 
We also review special cases that have been known in the 
literature. 

Cost analysis. Each of the algorithms in Tables IX, X, and 
XI provides reasonable algorithms with regular structure. The 
cost difference between any two of the T-group algorithms is 
0{n) where n is the transform size. We determine the cost in 
greater detail for the most relevant cases only. 

For a 2-power n, the costs in each case are independent of 
the chosen recursive split strategy. The best achieved costs are 
recorded in Table Xm. Note that the best costs for the skew 
(and inverse skew) and non-skew versions are equal since they 
have the same recursions and the base cases have equal cost 
(Table Vll). This is different for other sizes; in general the 
skew DTTs are more expensive (see also Appendix III). Also 
note that the polynomial DTTs save multiphcations (except 
for flie DCT-3 = DCT-3). 

For a 3-power n, the skew DTTs are more expensive. Also, 
the stated costs in Table XIII in this case are not the best 
possible with the algorithms in this paper. For example, we 
can sUghtly improve a DCT-3 of 3-power size n using the 
transpose of Table VI(b) to get a cost of 

(f^logsH - 2n + 2, |nlog3(n) - |r?, + ilog3(n) + |, 
> -f i log3(n) - 1) = 4nlog3(n) -ln + log3(n) + | 
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TABLE X 

COOLEY-TUKEY TYPE ALGORITHMS FOR DTTS IN THE T-GROUP, BASED ON THE DECOMPOSITION Tfc^ = Tfc (Tm ) AND A T-BASIS CHOSEN IN THE 
COARSE SIGNAL MODEL. THE POLYNOMIAL VERSIONS ARE OBTAINED BY REPLACING ALL TRANSFORMS BY THEIR POLYNOMIAL COUNTERPART. 

TRANSPOSITION YIELDS ALGORITHMS FOR DCT AND DST TYPE 2 AND 4. 



DCT-3„ = DCT-3„(l/2), 
DST-3n = DST-3n(l/2), 

DCT-4„ = DCT-4„(l/2), 
DST-4„ = DST-4„(l/2), 



DCT-3fc^(r) = K;;( DCT-3^(ri))DCT-3fc(r) ® U)^^^) 

0<i<fe 

DST-3fc^(r) = K:^( DST-3^(ri)) (DCT3fc(r) ® U)^*^^' 

0<i<fc 

DCT-4fe^(r) = K;;.( DCT-4^(ri)) DCT3fe(r) I^)^''; 
o<«fc 

DST-4fc^(r) = K» ( DST-4^(ri)) (DCT3fc(r) ® U)^^^*) 

0<i<fe 



(C4) 



(68) 
(69) 
(70) 
(71) 



TABLE XI 

Manipulated inverse of Table X: Cooley-Tukey type algorithms for the transposes of the DTTs in the T-group, based on the 

DECOMPOSITION T^m = Tfc (Tm) AND A T-BASIS CHOSEN IN THE COARSE SIGNAL MODEL. TRANSPOSITION YIELDS ALGORITHMS FOR THE DTTS IN 

THE T-GROUP. 



DCT-2„ = iDCT-3„(l/2), 
DST-2„ = iDST-3„(l/2), 

DCT-4„ = iDCT-4„(l/2), 
DST-4„ = iDST-4„(l/2), 



iDCT-3fc^(r-) = (C(^^))-i(iDCT-3fc(r) ® U)( iDCT-3^(r-i)) 

0<i<fe 

iDST-3fc^(r) = (C^f;;^))-i(iDST-3fe(r) ® U)( iDCT-3„(ri)) 

0<i<fc 

iDCT-4fc^(r-) = (C^^^')-i(iDCT-4fc(r-) ® I^)( iDCT-3^(r-i)) 

0<i<fc 

iDST-4fc^(r-) = (C(^^^)-i(iDST-4fc(r-)®U)( iDCT-3^(r-i)) 



(72) 
(73) 
(74) 
(75) 



0<i<fc 



TABLE Xn 

Algorithm FOR DCT-36 withcost(12, 6, 1). Transposition yields a DCT-26 algorithm of equal cost. 



"0 


1 








0" 











1 





1 


























1 








1 









Iie(F2diag(l,cosf)eF2diag(l,cosf )) J' f^'f^^'-I'l""'}} 
^ ' 5' °^ ' 5 l^lj diag(cos ^,2cos ^ 



"1 





-1 





1' 


1 





1/2 











1 

















1 





1 











1 






while sacrificing some regularity in structure. For example, 
for n = 9, Table XIII yields (32,12,4) = 48 and the 
above (32,11,3) = 46. The same costs apply to a DCT-2 
by transposing the algorithms. Reference [53] provides an 
(34, 8, 2) = 44 algorithm (proven optimal w.r.t. non-rational 
multiplications), with no obvious structure. Using (106), (105), 
(106), and (107), this also yields better algorithms for skew 
and non-skew DCT-4 and DST-4. 

For an arbitrary p-power n, we can compute T-group DTTs 
using the twiddle versions of the T-group algorithms (e.g., 
(51)). For example, a DCT-2pt computed with (72) requires, 
independently of the spUt strategy ^ logj,(n) DCT-2j,'s, and 

2(1 - -)n\ogJn) -2n + 2 
P 

additions and multiphcations, respectively. For a given 
DCT-2p kernel (e.g., the transpose of Table XII or [53] for 
p = 5, 7), it is now easy to compute a cost. The other T-group 
DTTs are analogous. 

Note that for a 2-power size n, the algorithms (56), (69) 
and transposed (73), for DST-3 have an 0{n) higher cost 



than a translation by duahty (105) (Appendix III). For 3-power 
sizes n, all algorithms, except those for DCT-3„ in Tables IX, 
X, and XI incur an ^n\og^(n) higher cost compared to 
translating these DTTs into a DCT-3 using 0{n) operations 
(see Appendix III). 

Further comments. 

• The algorithms in Table X have the appealing feature that 
all multiplications occur in parallel with additions on the 
same operands. Further, they are a good choice if the 
output is to be pruned, i.e., only, say, the first half of 
the output matters. This was used in [54] for DCT-2. 
However, for large k, the critical path is potentially 
prohibitive. 

< The cost of the T-group algorithms is independent of the 
chosen spht. 

• The algorithms in Table XI involve constants that are 
inverse cosines (from the base cases of the iDTTs in 
Table VII). This may cause numerical instabihty. 

• Transposition yields algorithms for the transposed DTTs 
with equal cost. The reason is that all occurring matrices 
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have this property. 

• If a non-skew DTT is decomposed using any of the T- 
group algorithms, then (the middle) one of the occurring 
skew DTTs in the direct sum has r = 1/2, i.e., is non- 
skew. 

• Any odd-size DCT of type 2 or 3 can be translated into 
an RDFT without incurring operations [53]. 

• Again we note that the algorithms in this section are 
not all available ones. In particular, there are orthogonal 
algorithms, which are due to other algebraic principles 
[42]. 

• All the algorithms have, for a 2-power size n, a total cost 
of 2nlog2(n) + 0(n). This can be improved by roughly 
5% with the recent paper [55] to ^nlog2(n) -|- 0(n), at 
the expense of some regularity. 

Literature. Algorithm (68) for 2-power size and k = 2 
was derived in [56] and in [33]; the latter also considered 
3 -powers and k = 3. For arbitrary p-powers (p prime) and 
k — p, the derivation is in [34]. The above references also used 
Chebyshev polynomials in their derivation, but they do not use 
the algebraic framework, and they present the algorithms in 
an iterative form only, which avoids the definition of skew 
DTTs. For software implementations, it is crucial to have a 
recursive form as presented here. Further, the derivation for 
p > 2 produced suboptimal cost compared to Table XTTT . 

Special cases of (68) with the reverse split, i.e., n = p*, 
k = p'^^, are not practical because of the long critical 
path for computing C„ fc. Their discovery, however, is more 
straightforward, since they do not require large skew DCTs, 
which are unexpected without knowing the underlying algebra. 
The case p = 2 was reported in [57], p = 3, 6 in [58], the case 
of a general p in [59] with examples p = 3, 5, 7, 9. 

Algorithms (69), its transpose, and the transpose of (71) 
were found, also for 2-powers and fc = 2 in [60]. The only 
special case of (70) we found in the literature was derived 
implicitly in [56], where the DCT-4 is called "odd DCT" 
and decomposed as part of a fast DCT-2 algorithm that first 
recurses using Table VI(a). 

Architecture regular versions (i.e., the equivalent to the 
Pease FFT [61]) of the algorithms in Table X, again for k = 2, 
can be found in [62], [63]. 

The only case of (72) we found in the Uterature is n = 2*, 
m = 2, in which case the skew DCTs become trivial [64]. 

All other algorithms for the T-group DTTs are to our best 
knowledge novel. 

Vlll. [/-GROUP DTT Algorithms 

We now derive Cooley-Tukey type algorithms for all four 
DTTs in the [/-group, based on the decomposition property 
(Lemma 3, ii)): 



(78) 



Since U does not decompose directly, the derivation involves 
a first additional step to factor Ukm-i into Uk-i{Tm) and 
[/„i_i using Theorem 1. In the special case of fc = 2, 
the decomposition in (78) becomes trivial and (78) becomes 
Leimna 2, ii). Thus, the algorithms in Table VI(a) become a 
special case of the algorithms derived below. 



The DTTs in the [/-group have associated modules 
'C[x]/p{x) with mutually distinct polynomials p, namely for 
DCT-1,DST-1, DCT-2, DST-2, respectively (see Table III) 

= (x2-l)[/„_2, t/„, (x-l)[/„_i, (x+l)[/„_i. (79) 

Thus, the decompositions are shghtly different and cannot be 
stated in a precise unified way as for the T-group in (VII). 
From (79), it is clear that in order to apply (78) for n = km, 
we have to consider DCT-2 and DST-2 of size n, but DCT-1 
of size n + 1 and DST-1 of size n — 1. This motivates the 
following definition, which we will use in the derivation. 



n' = < 



n+1, forDCT-l„/; 
n-1, forDST-l„/; 
n, for DCT-2„/,DST-2„ 



A. Simultaneous Derivation 

Let DTT„/ be one of the DTTs in the [/-group with module 

M = C[x]/pn' and C-basis, where C is one of T, U, V, W. 
We assume n = km. In the first step, we decompose M using 
the factorization (78) and Theorem 1 as 



C[x]/Pn' ^ C[x]/p™- ® C[,x]/[/fe_i(T™). 



(80) 



In the first summand, we choose a C-basis (i.e., equal to the 
one in ^Ay, in the second summand, we choose the basis b' 
(see (25)) given by 

b' = {CoUo{Tm), . . . , Cm-lUo{Tm) 



CoUk-2{Tm), ■ ■ ■ , Cm-lUk-2{Tm)) , 

which is required for the further decomposition of 
C[x]/Uk-i{T„i) using Theorem 2. This implies the choice 
of a [/-basis in the coarse module C[x]/Uk-i in all four 
cases. Any other choice of basis would lead to a transform 
that is not a DTT (see Table 111: only one DTT has p = 
U, namely DST-1). Also, it turns out that the base change 
matrices become more complicated for any other choice, and, 
in contrast to Section Vll, the inversion of algorithms to 
improve their structure does not work this time. 
Based on (80), we get the decomposition 



DTT„/ = Pj:*^ 

k,m 



l)m ' 



:DTTw)B 



(*) 



(81) 



where A 



(fe-i)m is a Fourier transform for C[x]/Uk-i{Tm) 
with basis b', and (•)'■*•' signifies dependency on the DTT; the 
exact form of these matrices will be provided below. Note 
that we can exchange the order of the summands in (81), if 
we properly permute the columns and rows, respectively, of 
Pfe m and Bkm- In two of the four cases, we will do this to 
obtain permutations ^ of a simpler structure. 

To apply Theorem 2 for further decomposition of A(^k-i)m7 
we need the zeros of Uk-i, which are given by cos ^, < 
i < k (TableXXIII in Appendix II), and thus 



C[x]/Uk-i{T^) 



C[x]/(T„ 

0<i<fe 



cosf) 



(82) 
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TABLE XIII 

Arithmetic costs achievable for the DTTs in the T-group with the algorithms in this paper for 2-power and 3-power size n. All 

THE 3-POWER size COSTS CAN BE SLIGHTLY IMPROVED UPON (SEE SECTION VII-E). 



S 



s 



'n-ansform 


Cost (adds, mults, 2-power mults) and total 


Achieved by 


DCT-3„ 


(|nlog2(n) -n + 1, ^nlog2(n), 0) 




(68) (see also (76)), (72)'^, Table VI(a)^ 




total: 2n log2(n) — n + 1 






DST-3„ 


same as DCT-3 




duality (105), Table VI(a)^ 


DCT-4„ 


(|nlog2(n), inlog2{)-t) + n, 0) 




(57), (70), (74), (106), and their transposes 




total: 2n log2 (n) + n poly: —n 






DST-4„ 


same as DCT-4 




(58), (71), (75), duality (105) 


DCT-3„ fr) 


same as DCT-3 




(68) 




(|nlog2(n) -n-l- 1, inlog2(n) -(- |n, 


0) 






total: 2nlog2(n) — |nH- 1 poly: — |n 




DCT-4„(r) 


same as DCT-4 




(57), (70) 


DST-4„(r) 


same as DCT-4 




(58), (71) 


DCT-3„ 


{|nlog3(n) -2n-t-2, |nlog3(n) - |n 


> 


(68), (72)'^, see also cheaper version 




total: An logg (n) -Zn + Z 




in Sprtinn VTT-F 


DST-3„ 


same as DCT-3 




duality (105) 


DCT-4„ 


(|nIog3(n) - n+ 1, ^n\og^{n) - \n, 


5" -5) 


fl06^ 




total: An log3(n) — n + 2 






DST-4„ 


same as DCT-4 




duality (105) 


DCT-3„(r) 


(|nlog3(n) -n-t-1, |nlog3(n), 0) 




(68) 




total: An log3 (n) — n + 1 






DST-3„(r) 


(|nlog3(n) - 71 -1- 1, |nlog3(ra) + inH 




(107) 




total: An logj (n) + 1 






DCT-4„(r) 


(|nlog3(n), |nlog3(n) -1- |n- i, |n 




(112) 




total: An log3(n) -|- n 






DST-4„(r) 


same as DCT-4 




equivalent to (112) 



is decomposed by DST-l/j _i (note that Theorem 2 requires 
us to choose a polynomial transform). The smaller modules in 
(82) are decomposed, respectively, by skew DTTs as 

C[a;]/(T„ - COS f ) ^ ®o<j<m(^ ' ^osnjw), (83) 

where the j and their order are computed by Lemma 1. 
The type of the skew DTT is determined by the (7-basis. For 
example, for DTT = DCT-1, C = T and thus (83) is de- 
composed by a DCT-3 (see Table III). The final factorization 
of ^(fe-i)m is given by 

0<i<fc 

(84) 

In summary, we obtain the following algorithm for a DTT^ 
in the t/-group: 

DTT„, = Pi,l(DTT„, (85) 
Ak-i)m = ( DTT:„(f))(DST:Tfc_i 01™), 

0<i<fe 

where we fused the permutations ^ in (81) and Qj^ ^ in 

(84) to a permutation ^. Equation 85 remains valid when 
the occurring DTTs are replaced by their polynomial versions 
to yield 

DTT„, = Pi;i,(DTT„, ®Ak-i)m)Bi*l, (86) 

= ( dtt1,(|))(dst:t,_i 

0<i<fe 



In the following four sections, we will give the special 
structure of the matrices ^ and Bj. ^ in all four cases. 
They are shown in Table XIV. We will analyze the arithmetic 
cost for 2-power sizes n = 2*. In all cases it turns out that, 
in contrast to the T-group algorithms derived above, the cost 
does depend on the chosen split, with the minimum obtained 
for the case k = 2, which is equivalent to Table VI(a). 
Further, the structure of the algorithm (86) shows that the 
polynomial version of the DTT requires a smaller number of 
multiplications than the DTT if this holds for the base case 
n = 2, which is easy to check. The result is that only DCT and 
DST of type 2 yield savings. For the occurring skew DTTs, 
we use the algorithms and the arithmetic cost provided in the 
previous sections. 



B. Details 

We provide the exact form of the base change matrices 

Bj^ ^ and permutations ^'^^ using the mnemonic names 
* e {Cl, SI, C2, 82} to denote the 4 DTTs in the i7-group. 

^(C1)^^(C1).^(C1) 
fc,m K,m k,m ' 
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TABLE XIV 

COOLEY-TUKEY TYPE ALGORITHMS FOR DTTS IN THE i7-GR0UP, BASED ON THE DECOMPOSITION C/fc^_l = Um-1 ■ Uk-l{Tm)- THE POLYNOMIAL 
VERSIONS ARE OBTAINED BY REPLACING ALL TRANSFORMS BY THEIR POLYNOMIAL COUNTERPART. 



DCT-lj.^_(.i 
DST-lfc^_i 



-Pilm'MDCT-U+ieC DCT-3^(i))(DST-lfc_i®U)]i?[,';^' 

^ 0<i<fc ^ 

-Pfc^m (( DST-3^(i))(DST:Tfc_i®U)eDST-U_i')B(' 

^ 0<i<fc ^ 

^ 0<i<fe ^ 

-Pfcfm ( ( DST-4^(i)) (DSTlTfc.i I^) ® S^f 



(SI) 



k.m 



,(S2) 



(87) 
(88) 
(89) 
(90) 



where Cj^^^ is given by 



1 1 

Im— 1 Jm— 1 I'm— 1 

1 1 



TABLE XV 

Base cases for [/-group DTTs. 



1 -1 

Im— 1 <Im— 1 

1 

Im— 1 Jr: 



-1 

Jm— 1 

1 -2 

Im— 1 Jm— 1 



and £) 



(ci) 

k.m 



I 



m+l ' 



fe-1 < 



)(l/2©I„_i)). 

j(Cl) 



Note that the first block row of B/. ^ represents the 

signal extension of the signal model for DCT-l„i+i (namely 
T/c mod Tm+i, k > m+l, see [2]) . Similar statements hold 
for the matrices below. 



Im— 1 Jm — 1 

1 

Im— 1 Jm— 1 

1 



?(S1). 



-1 -Jm-l 
i 



fe,m 



?(52) . 




Im— 1 ^ Jm— 1 ^ Im— 1 

Im Jm Im Jm * * * 



Im Jri 



J-m *Jm 

Im Jm 



Im Jm. 



Im J m 

Im Jn 



Im Jn 



Im Jm Im Jn 



DCT-12 = F2 

DST-li = Ii 

DCT-22 = diag(l, ^) • F2 
DST-22 = diag(4=,l)-F2 



DCT-I2 = DCT-I2 
DST-li =Ii 
DCT-22 =F2 
DST-22 = F2 



5(S2) _ 



■■ (Ifc ®(Jfc-i ® Ii) ® Ifc ®(Jfc-i ® Ii) I 



Base cases. For 2-power size, the recursions in Table XIV 

need as base cases Table XV and the skew DTTs in Table VII. 

Special cases. The recursions in Table XIV take the simplest 
form for fc = 2, in which case they coincide with Table VI(a). 



C. Alternative decomposition 

We do not discuss algorithms based on Lemma 3, vi). 
Similar statements as in Section VU-D hold. 



D. Analysis 

We analyze the algorithms in Table XIV. 

Arithmetic cost. We only consider a 2-power size n. In 
contrast to the T-group algorithms in Section VII, the cost 
of the algorithms does depend on the split. The minimum is 
obtained for = 2, in which case the algorithms coincide with 
Table VI(a). The cost in these cases is shown in Table XVI. 

Literature. Except for the case k = 2 (see Section V-B), 
we did not find any of these algorithms in the literature. 



IX. y-GROUP DTT Algorithms 

In this section, we present algorithms for all DTTs in the 
y-group, i.e., the DCT and DST of type 7 and 8, based on 
Lemma 3, iii): 



Pjf'J,^ =Ii ©((Jfc-i ® Ii) ® Ifc ®(Jfc-i ® Ii) ® Ifc . . . )Lfc-i 



d(S1). 



)(C2) . 



: (lk-1 ®(Il ® Jfc-l) ® Ifc ®(Il ® Jfc-l) ® . . . 

= (Ifc ®(Ii ® Jfc-l) ® Ifc ®(Ii ® Jfc-l) ®...)Ll, 



V{k-l)/2+km = Vm ' V(fe-i)/2(72m+i). 

Since the derivation is analogous to the previous sections, we 
wiU only state the results without a detailed derivation. 
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TABLE XVI 

Arithmetic costs for DTTs in the [/-group achievable with the recursions in Table XIV. The size of DCT-1 is n = 2*^ + 1, the size of 

DST-1 is n = 2*= - 1, AND THE sizes OF DCT-2 AND DST-2 IS n = 2*= 



TVansform Cost (adds, mults, 2-power mults) and total Achieved by 

DCT-1„ ( |n log2 (n - 1) - 2n - i logj (n - 1) + 6, Table VI(a) = (87) for fc = 2 
inlog2(n - 1) - n - i logjCn - 1) + 2, 0) 
total: 2nlog2(n — 1) — 3n — log2(n — 1) + 8 

DST-1„ ( f n log2 {n + 1) - 2n + § loga (n + 1) + 2, Table VI(a) = (88) for fe = 2 
|nlog2(n + l)-n+|log2(n + l), 0) 
total: 2n log2 (n + 1) - 3n + 3 loga (n + 1) + 2 

DCT-2„ (|nlog2(n) - n + 1, inlogaW, 0) Table VI(a) = (89) for fc = 2, 

total: 2nlog2(n) - n + 1, 0) poly: -(n - 1) (68)^ (see also (76)^), (72) 

DST-2„ same as DCT-2 Table VI(a) = (90) for k = 2, duality (105)'^ 



A. Simultaneous Derivation 

For the DTTs in the F-group (see Table III) with associated 
modules C[x]/p, the polynomial p takes (up to a constant) two 
different forms: (x+l)y„_i for DCT-7„ and DST-8„ and K 
for DST-7„ and DCT-8„. To derive all four decompositions 
simultaneously, we thus define 

,_(n+l, for DCT-7„/,DST-8„'; 
^ ~ \n, for DST-7„/,DCT-8„'. 

Now we consider DTT„' in the y-group with a (7-basis, C G 
{T, U, V, W} and assume that n = km+{k- l)/2 = (2m + 
— l)/2 + m. Necessarily, k is odd. Using Theorem 1, we 
first decompose 

C[x]/pn' = C[x]/F^(T2™+l) ® C[x]/pm'. (91) 

In the second summand, we choose a (7-basis. In the first 
summand, we choose 



b' = {CoVo{Tm),...,C^-lVo{Tm) 



CoV, 



{k-l)/2-l{Tm 



, C'm-lV(fc_i)/2_l 



as required by Theorem 2. This implies a F-basis in the coarse 

module C[x]/V(j._i)/2 ™d a (7-basis in the skew modules. 
We denote the base change matrix for (91) with -B^*^, where 
* e {(77, S7, (78, 58}. The exact form will be shown below. 

Next we decompose the second summand in (91) by 
DTTto', and the first summand using Theorem 2. The oc- 
curring skew DTTs have the same (7-basis as the given 
DTT,,/, for example DCT-7 has a T-basis and hence the 
associated skew DTT is DCT-3. We denote that skew DTT 
with DTT'. The subalgebra C[x]/V(^k-i)/2 with V-basis is, 
in all four cases, decomposed by the polynomial transform 
DST-7(fe_i)/2- The final result is the decomposition 

DTT„, = Pi%{A^2m+iKk-i)/2 © DTT„04i> 
with 

^(2m+l)(/s-l)/2 = 

( DTT'2„+i(^))(DST:7(,_i)/2®Wi). 

0<i<(fe-l)/2 



(92) 



We obtain the corresponding algorithm for the polynomial 
DTT„/ by replacing all DTTs by their polynomial counter- 
parts. Transposition of (92) yields a different set of algorithms. 



B. Details 

In this section, we record the exact form of all four classes 
of decompositions based on (92). We need the following base 
change matrices. 

For DCT-7„ and DST-8„ we require n = km+{k + l)/2. 
Then, 

?(C7) 



k.m k.m k.m ' 



with 



(C7) 



-2 


-1 






l2m. — J2m 








1 


-1 








— Jim 








1 


-1 






l2m 








1 


1 








l2m 










- Jm 



and the diagonal matrix 

4? = (I(fc-l)/2®(V2©l2j)' 



1 



m+1 ; 



ri2 



B 



(S8) 



l2m+l 



>2m+l 



l2m+l J2m+1 
l2m+l 



1 



For DST-7 and DCT-8, we require n = km+{k-l)/2. Then 



25 



TABLE XVm 
Base cases for V-group DTTs. 



DCT-72 
DST-7i 
DCT-81 
DST-82 = [ ^{ 



1 1/2 

1 -1 

2 '-I 

2 ^1 
2 1 



DCT-72 = DCT-72 
DST-7i = Ii 
DCT-81 = I2 
DCTlSa = [\\] 



l2m. J2m 








1 


1 






l2m. 


J2m 








1 


1 






l2m 


J2m 








1 








l2m 










Jm 






1 







l2m-|-l 


— J2m + 1 
l2m-|-l ~ 


J27TI-I-I 






d(C8) _ 






I2m-Hl 


— J2m-Hl 
I2m-Hl 


Im 
■■• 

— Jm 






L 




- Im ■ Jm In 





I • -Jm ■ • ■ ■ 









Correspondingly, we need two closely related types of 
permutations. Let k, m be fixed. For a given i to be mapped, 

we decompose i into the radix-fc representation i = ii + i2k, 
with ii = i mod k and 12 = i div k. Then P^^'^P = Pjf^ is 
a permutation on the set {0, . . . , km + (fc - l)/2} defined by 



d(C7) ^ ■ 

ii(2m - 
ii{2m - 



>(S8) 

k.7n 

-1)4 



= «i 

2^2, 
«2, 



(fc-l-ii)(2m + l) + 2j2 + l, 



for < 
for ii = 
for - 



fi ^ 2 ' 

— fc-i ■ 
2 ' 

i < ii < fc. 



This permutation leaves the last point fixed. By omitting 

this point, i.e., by restricting the permutation to the set 
{0, . . . , km + (fc — l)/2 — 1}, we get the permutation P^^J^ = 

p(C8) 

Base cases. The base cases for the DCTs and DSTs of type 
7 and 8 are shown in Table XVIII. The sizes are motivated 
below in the cost analysis. 

Special cases. The recursions in Table XVII take the 
simplest form for fc = 3 (note that k has to be odd), in which 
case they coincide with the algorithms in Table VI(c). 

C. Analysis 

To analyze the arithmetic cost of the algorithms in Ta- 
ble XVII, the first question is which sizes are best decom- 
posable or "natural" for these DTTs. For example, for all 
DTTs of type 1-4 the best decomposable size is 2*, with the 
exception of DCT-1, which has 2* + 1, and DST-1, which 
has 2* — 1. These sizes allow a complete decomposition into 
2x2 transforms. 



For m = 1, the decompositions in Table XVII are trivial, 
thus we obtain upon decomposition skew DTTs of odd size 
larger than 1. Hence, the best outcome is that 2m -|- 1 is a 
3-power, to allow at least a decomposition of the occurring 
skew DTTs into 3x3 transforms using Table X or IX. 
Further, we want to be able to further decompose the occurring 
DST-7(j,_i-)/2> which requires that (fc — l)/2 has again the 
form km + {k — l)/2 (with a different k). Inspection shows 
that these conditions are satisfied for n = (3' — l)/2. Namely, 
for < s < t 

(3* - l)/2 = 3'*(3*-'* - l)/2 + (3* - l)/2, 

which matches n = (2m-M)(fc-l)/2-|-mfor m = (3*-l)/2 
and k = 3*~*. In summary, the best decomposable sizes are 
thus n' = n + 1 = (3' + l)/2 for DCT-7 and DST-8, and 
n' = n = (3* - 1) /2 for DST-7 and DCT-8. This also impUes 
that the base sizes are 2 and 1, respectively. 

Using the arithmetic cost of the skew DTTs of 3-power 
size in Table XIII, we get Table XIX. Note that the costs 
for DCT-8, DST-8 are not achieved using Table VI(c) but 
only by duality (105). The reason is that Table VI(c) yields 
for these DTTs the same recurrence as for DCT-7 and 
DST-7, respectively, but with more expensive skew base cases 
(Table VIII). 

Literature. We did not find any of the algorithms in 
Table XVII in the Uterature. 

X. W^-GROUP DTT Algorithms 

We present algorithms for the DTTs in the W^-group based 
on the decomposition in Lenoma 3, iv): 

W(^k-l)/2+km = Wm ■ W(^k-l)/2{T2m+l)- 

The derivation and discussion is very similar to Section IX, 
so we will be brief. 



A. Simultaneous derivation 

We define for the DTTs in the VF-group 

, _ jn+1, for DCT-5„/ , DCT-6„/ ; 
" ~ I n, for DST-5„/ , DST-6„/ . 

The definition is motivated by the associated polynomial p in 
C[x]/p, namely p{x) = (a; - l)W„_i for DCT-5 and DCT-6 
and Wn for DST-5 and DST-6. 

Now let DTT„/ be in the W-group with C-basis. Then the 
first decomposition step yields 



C]/Pn' = C[x]/W^iT2^+i) ® C[x]/p„ 



(97) 



In the second summand we choose a C-basis. In the first 

summand we choose 

b' = {CoWo{Trr,),...,Cm-lWo{Trr,) 



CoW(^k-l)/2-l{Tm), Cm-lW(^k-l)/2-l{Tm)), 

as required by Theorem 2. This implies a T4^-basis in the coarse 
module C[x]/W(^k-i)/2 and a C-basis in the skew modules. 
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TABLE XVII 

COOLEY-TUKEY TYPE ALGORITHMS FOR DTTS IN THE y-GROUP, BASED ON THE DECOMPOSITION V(fe_i)/2+fcm = ■ V(fc_i)/2 (Tbm+l ); IS 

ODD. The polynomial versions are obtained by replacing all transforms by their polynomial counterpart. 



DCT-7 



fem+{fc + l)/2 



DST-7i.^_l_(j._i)/2 



DST-8 



fem,+ {fc + l)/2 



Pk,l{i DCT-32^+i(2i±i))(DST-7fc^ ®l2,„+i)eDCT-7„+i)B(' 



p(C7) 



p(S7) 
K,m 



(( 

^ n^,^ k-1 



DST-32r„+i(22±i))(DST-7fc^ ® W+i) eDST-7^ )B] 



)' 



,(37) 



DCT-8fc^+(fc_i)/2 = Pi'^'(( DCT-42^+i(2i±i))(DST-7^®W+i)©DCT-8^)sf^ 

^ o<i<^ ^ 

^ifm(( DST-42^+i(^))(DST:7fc_i cala^+OeDST-S^+i'ji?*^ 



(93) 
(94) 
(95) 
(96) 



0<i<-! 



TABLE XIX 

Arithmetic costs for DTTs in the F-group achievable with the recursions in Table XVII. The size for DCT-7 and DST-8 is 

n = (3* + l)/2, for DST-7 and DCT-8 is n = (3*= - l)/2. 



TVansform 


Cost (adds, mults, 2-power mults) and total 


Achieved by 




DCT-7„ 


(|nlog3(2n - 1) - 3n - 1 log3(27t - 1) + 3, 


Table Vl(c) = 


(93) for jk = 3 




|nlog3(2n - 1) - 2n - § log3(2n - 1) + 2, log3(2n - 1)) 








total: 4nlog3(2n — 1) — 5n + 5 poly: same 






DST-7„ 


(|nlog3{2n + 1) - 3n + i log3(2n + 1), 


Table Vl(c) = 


(94) for jfc = 3 




|nlog3(2n + 1) - |n + 1 log3(2n + 1), fn - i Iog3(2n + 1)) 








total: 4n Iog3 (2n + 1) - 4n + log3 (2n + 1) 






DCT-8„ 


same as DST-7 


duaHty (105) 




DST-8„ 


same as DCT-7 


duaUty (105) 





The corresponding base change matrix is 5^*^, where * G and the diagonal matrix 
{C5, S5,C6, S6}. The exact form will be shown below. 
The full decomposition becomes 

DTT„, = Pt:l{A^2m+m-i)/2 e DTT^O^ii, (98) = Wi e(I(fe_i)/2 ®(l/2 ® hm)) 

with 



At 



( DTT^„+i(^))(DST-5(,_i)/2®Wi). 



l(2m+l)(fe-l)/2 

e 

0<i<(/s-l)/2 

B. Details 

The base change matrices and permutations in (98) are as 
follows. 

d(C5) ^ ^(C5) _ jj{C5) 
k.m k.m fc.m 



111 



k,m 



2 -1 

l2m — J2m 

1 -1 
l2m — J2m 



1 -1 
l2m —32m 

1 -1 

l2m 



- It 



k,m 







- Jm '■ Im Jm i 




l2n 



J217 



l2m 



J2m 
l2m 



1 



j(C6) . 



Im Jm Im Jm 

1 1 



l2m+l — J2m+1 

l2m+l — J2m+1 



l2m+l — J2m+1 
l2m+l 
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TABLE XXI 
Base cases for W-grovp DTTs. 



DCT-52= 
DST-5i = ^ Ii 
DCT-62 = 
DST-61 



1/2 -1 

9 ^1 



= DCT-52 
Ii 

-2 J 

DCT-61 =Ii 



DCT-52 
DST-5i = 
DCT-62 = [ J / 



l2m + l J2m+1 






l2m-l-l 


J2m-t-l 






l2m+l J2m+1 


- Im 
■ ■ 
Jm - 




l2m-Hl 



d(S6) ^ 



To state the permutations, we decompose the index i to be 
mapped as before into i = ii+i2k. Then the permutations for 
for DCT-5 and DCT-6 operate on i e {0, . . . , km+{k-l)/2} 
and are given by 

p(C5) _ p(C6) _ . . , 

12, for ii = 0; 

ii{2m + 1) + 2i2 - m, for < ii < 

_ (fc - ii)(2m + 1) + 2^2 - m + 1, for ^ < ii < fc - 1. 

This permutation leaves the first point fixed. The correspond- 
ing permutation for the DSTs type 5 and 6 operates on one 
point less and arises from -P^*^^ by omitting row and 
column 0. The definition is 



' ii{2m + 1) + 2i2 + m, 
{k - ii)(2m + 1) + 2i2 - 3m - 1, 

«2, 



fc-1 . 
2 ' 



for ii < 
for ^ < ii < fc- 1; 
for zi = A: — 1. 



The final algorithms are shown in Table XX. 

Base cases. The base cases are in Table XXI. The sizes are 
motivated in the cost analysis below. 

Special cases. The algorithms in Table XX have the sim- 
plest form for fc = 3 in which case they coincide with 
Table VI(d). 

C. Analysis 

The natural sizes, i.e., those sizes that yield a decomposition 
into the smallest DTTs are n = (3* + l)/2 for DCT-5 and 
DCT-6 and n = (3* - l)/2 for DST-5 and DST-6. For these 
sizes we can achieve the cost in Table XXII. 

Literature. We did not find any of the algorithms in 
Table XX in the hterature. 

XI. Conclusions 
We presented an algebraic approach to deriving fast trans- 
form algorithms; in particular, we identified the general prin- 
ciple behind the Cooley-Tukey FFT. By applying the approach 
to the 16 DTTs, we derived equivalent "Cooley-Tukey type" 



algorithms of similar structure. Thus, we could explain many 
existing algorithms, but discovered an even larger number of 
new algorithms that could not be found with previous methods. 

The principle behind Cooley-Tukey type algorithms is poly- 
nomial decomposition for finite regular shift-invariant 1-D 
signal models (or, equivalently, polynomial algebras), or, more 
generally, induction as briefly discussed in Section IV-C. 

We hope to have achieved several things with this paper. 

The paper is a first step to obtaining a comprehensive theory 
of fast transform algorithms: a theory that classifies algorithms, 
provides insight to why they exist, illuminates their structure, 
and enables their concise, systematic derivation. 

Second, the theory of algorithms in this paper is a natural 
application of the algebraic SP theory. In [1], [2] the concept of 
signal model (as in Definition 1) was motivated as the natural 
structure underlying SP. In this paper, it becomes the key 
to derive, discover, and understand algorithms. The algebraic 
approach ties together SP theory and transform algorithm 
theory. 

Third, the paper reinforces the case for representing algo- 
rithms as structured matrices, an approach that was already 
successfully employed for the DFT in [35], [32] and occa- 
sionally in research papers (e.g., a early as [21] and more 
systematically developed and exploited in [65]). 

Fourth, by summarizing many existing and deriving many 
new algorithms, this paper can be a reference paper on 
algorithms that is useful for application developers that are 
only interested in their application. 

Future papers will derive and explain other algorithms 
available for trigonometric transforms, including real DFTs, 
orthogonal DTT algorithms, and generaUzed Rader type algo- 
rithms. 
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Appendix I 
Chinese Remainder Theorem 

Let C[x\/p{x) be a polynomial algebra (see Section II-A) 
and assume that p{x) = q{x)r{x) factors into two coprime 
polynomials, i.e., gcd(g, r) = 1. Then the Chinese remainder 
theorem (for polynomials) states that 

(/): C[x]/p{x) C[x]/q{x)®C[x]/r{x) 

s{x) I— > {s{x) mod q{x),s{x) mod r{x)) 



table xxni 

Four series of Chebyshev polynomials. The range for the 
ZEROS IS < fe < n. In the trigonometric closed form cos 6 = X. 



0,1 closed form 



symmetry 



Tn 1, X 

U„ 1, 2x 

Vn l,2x-l 

Wn l,2x + l 



cos{n6) 

3in(n-|-l)fl 
sin 6 

cos ^ 
sin(n-|-i)6l 



T_ 
U- 

V- 



--Tn 
---Un 



W-n = -Wn-l 



(fc + l)vr 

n+1 
(fe+i)7r 



(fc + l)" 



is an isomorphism of algebras. In words, C[x]/p{x) and 
C[x]/q{x) ® C[x]/r{x) have the same structure. Formally, 



+ s') 
^{s ■ s') 



= <t>{s)+ct>{s') 



which means informally that computing in C[x]/p{x) and el- 
ementwise computing in C[x]/q{x)(BC[x]/r{x) is equivalent. 



Appendix II 
Chebyshev Polynomials 

Chebyshev polynomials are a special class of orthogonal 
polynomials and play an important role in many mathematical 
areas. Excellent books are [66], [51], [67]. We only introduce 
the definitions and the properties of the polynomials we use 
in this paper. 

Let Co = 1, Ci{x) a polynomials of degree 1, and define 
Cn, n> 1 by the recurrence 

C„{x) = 2xC„_i - Cn-2{x). 

Running this recurrence backwards yields polynomials C_„, 
n > 0. Each sequence (C„)„gz of polynomials defined this 
way is called a sequence of Chebyshev polynomials. It is 
uniquely determined by the choice of Ci. Four special cases 
are of particular importance in signal processing [2], [3] and 
in this paper. They are denoted by C G {T, U, V, W} and are 
called Chebyshev polynomials of the first, second, third, and 
fourth kind. Table XXIII gives their initial conditions, their 
closed form, their symmetry properties, and their zeros. 

For example, T„(x) = cos{n9), where cos^ = x. The 
closed form easily yields the zeros of T„. 

We will use the foUowing properties of Chebyshev polyno- 
mials: 

1) For any sequence of Chebyshev polynomials with initial 
conditions Co, Ci, we have 

C„ = CiC/„_i - Co;7„_2. (103) 

2) For any sequence of Chebyshev polynomials C„, 

TkCn = {Cn-k + Cn+k)/2. (104) 

3) The identities in Table XXIV hold. They are based on 
trigonometric identities. 
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(2n-2,2n-l) 



iDCT-3„(r) 




DCT-3„(r) 



(2n-2,2n-l) 

Fig. 3. Number of additions and multiplications (adds, mults) required to translate DCTs of types 2-4 into each other for odd size n. 



TABLE XXIV 

Identities among the four series of Chebyshev polynomials; 

C„ HAS TO BE REPLACED BYT„, Vn, Wn TO OBTAIN ROWS 1,2,3,4, 

RESPECTIVELY. 







— Cn-2 


c„ 


— Cn-l 
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2(2^2 


- l)Un-2 
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l)Wn-l 


{X + l)Vn-l 


Un 




2T„ 




Vn 


Wn 


Vn 
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- l)Wn-l 
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- l)Un-l 


2Tn 


Wn 
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Appendix III 
Relationships Between DTTs 

We use in this paper the following relationships between 
DTTs. The explanation for their existence and proofs can be 
found in [2]. 

Duality. Two DTTs DTT„,DTT^, which have flipped 

boundary conditions are called dual to each other. They are 
necessarily in the same group. The duality property is not vis- 
ible from Table III since we omitted the boundary conditions. 
Thus we just state the pairs: DCT-3/DST-3, DCT-4/DST-4, 
the DTTs in the [/-group are aU self-dual, DCT-7/DST-8, 
DST-7/DCT-8, DCT-5/DCT-6, DST-5/DST-6. 
The foUowing relationship holds for dual DTTs: 



diago<fc<„((-l)') • DTT„ • J„ = DTT^ . 



(105) 



As a consequence any DTT algorithm can be converted into 
a DTT' algorithm without incurring additional operations. 

Base change. Two DTTs (or skew DTTs) in the same group 
(e.g., T-group) have (at least almost) the same associated 
algebra. As a consequence they can be translated into each 
other using a suitable base change and Table XXIV. 

Examples include: 

DCT-4„ = 5„-DCT-2„- iZ?„(l/2)-i (106) 
iDCT-4„(r) = 5„ • iDCT-3„(r) • i£)„(r)-i 

where Sn is defined in (62) and Dnif) = 
diagQ<j,^„(cos ^tt). The are computed from r using 
Lemma 1. 

Skew and non-skew DTTs. Every skew DTT(r) can be 
translated into its non-skew counterpart DTT: 



DTT„(r) = DTT„-Xi*Vr), and 
DTT„(r) = DTT„-Xi*\r). 



(107) 



Here, Xn^(r) depends on the DTT and takes the following 
forms, indicated by * e {C3, 53, C4, 54}. 



1 
ci 



si 

Cl 









s„-i 



Cn-\ 
-Sn-\ 



Cn-1 





(108) 



(109) 



(110) 



(111) 



In these equations, ce = cos(l/2 — r)£TT/n, si = sin(l/2 — 
r)£Tr/n, ^ = cos(l/2-r)(2^+l)7r/(2n), and ^ = sin(l/2- 
r){2e + l)7r/(2n). Where the diagonals cross in (108)-(111), 
the elements are added. 

Combining (107) with (106) gives, for example 

DCT-4„(r) = 5„-DCT-2„-i£»„(l/2)-i-Xp)(r). (112) 

The diagonal matrix can be fused with the x-shaped matrix to 
save multipUcations. 

Inversion of (107) gives the corresponding identities for the 

iDTT(r)'s: 

iDTT„(r) = {Xi*\r))-'- DTT^ . (113) 

The matrices (^Xn \r)) ^ have the same x-shaped structure 
and the same arithmetic complexity as Xn\r) and can 
be readily computed because of their block structure. For 
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example: 



(^PHr-)) 



-1 



■n 











c, 



'n—1 



1 



cos(l/2 — r)7r 







The above identities show that the complexity of the skew 
DTTs differ from the complexity of the DTTs by 0{n). 

Figure 3 displays the cost, as a pair (additions, multiplica- 
tions), of translating skew and non-skew DCTs of types 2-4 
into each other for odd size n. 



